[Koha] [Koha-devel] Problems searching callnumbers with Koha and ICU

dcook at prosentient.com.au dcook at prosentient.com.au
Wed Jul 10 19:33:41 NZST 2019


HI again, Mason,

I just remembered a little trick that you might find useful.

Try the following: 
echo "PZ 7 .W663 1984" | yaz-icu -x -c /path/to/phrases-icu.xml
echo "PZ 7 .W663 1984" | yaz-icu -x -c /path/to/words-icu.xml

That should show you how the string is normalized and tokenized for indexing
with ICU. 

You should see the same thing when you're using yaz-client, but this can be
a bit more convenient I reckon.

David Cook
Systems Librarian
Prosentient Systems
72/330 Wattle St
Ultimo, NSW 2007
Australia

Office: 02 9212 0899
Direct: 02 8005 0595

-----Original Message-----
From: koha-devel-bounces at lists.koha-community.org
<koha-devel-bounces at lists.koha-community.org> On Behalf Of
dcook at prosentient.com.au
Sent: Wednesday, 10 July 2019 5:05 PM
To: 'Mason James' <mtj at kohaaloha.com>; koha at lists.katipo.co.nz;
koha-devel at lists.koha-community.org
Subject: Re: [Koha-devel] Problems searching callnumbers with Koha and ICU

Hi Mason,

Can you tell us what version of Zebra you're running? And what is your exact
query? 

According to https://packages.debian.org/stretch/idzebra-2.0, you're
probably running Zebra 2.0.59, unless you're pulling packages from
Indexdata's APT repository. 

I discovered a ICU bug in Zebra 2.0.59 back in February 2015, which could
very well be impacting you now. At the time, I thought it was just an issue
when hyphens were used in search terms, but I've had the same problem with
spaces lately when using "se,phr,ext" (which uses the phrase register rather
than the word register) with Zebra 2.0.59 on Debian. 

I think most people using ICU are using Zebra from Indexdata's APT
repositories. I had an issue with that recently but I'm going to revisit it
soon. 

I have a few other ICU related questions that I have asked Indexdata, but so
far I haven't heard back. That's mostly about how normalization and
tokenization is done at search time vs index time, as I don't think the
documentation is clear about that. 

(For instance, https://software.indexdata.com/zebra/doc/icuchain-files.html
says " The ICU chain files defines a chain of rules which specify the
conversion process to be carried out for each record string for indexing.
Both searching and sorting is based on the sort normalization that ICU
provides. This means that scan and sort will return terms in the sort order
given by ICU." Which to me sounds like different rules are used for indexing
and searching/sorting, which is consistent with my testing. I think
search/sort uses default ICU settings while indexing uses custom settings
and we replace apostrophes with a space when indexing in the word register
but search replaces apostrophes with nothing which creates tokenization
issues that don't match up, but I digress...)

Relevant reading:
1. Look for ZEB-664 in https://software.indexdata.com/zebra/doc/NEWS
2. Robin opened a bug report in Debian but it never went anywhere:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=777515;msg=5
3.
https://github.com/indexdata/idzebra/commit/704fd190292cb771df94553b0ed6f9f4
b71660a6
4. https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=16581

David Cook
Systems Librarian
Prosentient Systems
72/330 Wattle St
Ultimo, NSW 2007
Australia

Office: 02 9212 0899
Direct: 02 8005 0595

-----Original Message-----
From: koha-devel-bounces at lists.koha-community.org
<koha-devel-bounces at lists.koha-community.org> On Behalf Of Mason James
Sent: Wednesday, 10 July 2019 3:54 PM
To: koha at lists.katipo.co.nz; koha-devel at lists.koha-community.org
Subject: [Koha-devel] Problems searching callnumbers with Koha and ICU

Hi Folks
Has anyone hit a problem searching callnumbers with Koha and ICU -
specifically callnumbers with SPACE ' ' characters?
An example problematic callnumber is 'PZ 7 .W663 1984'
 
Or, has anyone had *success* searching callnumbers with ICU? :) Either way,
I'd be curious to hear from you

I tested on Koha 18.05.12 and Debian 9.8


Cheers, Mason
_______________________________________________
Koha-devel mailing list
Koha-devel at lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
website : http://www.koha-community.org/ git :
http://git.koha-community.org/ bugs : http://bugs.koha-community.org/


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 484 bytes
Desc: not available
URL: <https://lists.katipo.co.nz/pipermail/koha/attachments/20190710/cb0f7929/attachment.sig>


More information about the Koha mailing list