[Koha] How to make the Koha/Zebra search ignore hyphens?

Thu Sep 26 13:51:12 NZST 2019

Hi Michael,

Your experience suggests to me that you're using Zebra 2.0.59 (which is the package available in the official Debian repositories). There is a bug in that version which causes hyphens to cause an incorrect truncation so that "Sinti-Swing" becomes "Sinti". If you use the Indexdata Debian repository and upgrade to the latest version of Zebra (or any version higher than 2.0.59 such as 2.0.60), you shouldn't have that problem anymore. (Debian doesn’t have an active maintainer for idzebra-2.0 if I recall correctly, so it's never going to be fixed in Debian, unless someone new steps forward. I've thought about doing it, but I have enough responsibilities already, and the workaround here is fairly trivial. That said, we as a community should probably do more with the Koha instructions to warn about this problem...)

I warned in my original email that you will have to modify words-icu.xml and phrases-icu.xml to get the behaviour that you're wanting as well.  You'll want to add a "transliterate" or "transform" rule before the "tokenize" rule to remove the hyphens. I don't know the exact rule you'll need, so you'll have to experiment a bit. You can read more about that at https://software.indexdata.com/yaz/doc/yaz-icu.html. 

If you upgrade your Zebra and modify your ICU chain files, I think you should be able to achieve the behaviour you're wanting. 

Take the time to fully read the documentation at https://software.indexdata.com/yaz/doc/yaz-icu.html, as you can use yaz-icu to test the ICU configuration directly without having to reindex Zebra every time. Note that you may have to install yaz-icu as I don't know that it's installed by default on Debian when you install idzebra. (I rarely use Debian/Ubuntu for Koha, so my exact experiences can be a bit different.)

Actually, I'm going to do a little test myself.

Standard words-icu.xml:
echo "Sinti-Swing" | yaz-icu -c words-icu.xml
1 1 'sinti' 'Sinti'
2 1 'swing' 'Swing'
See that there are two separate tokens there. 

Using the following words-icu.xml with a transform rule before the tokenize rule:
<icu_chain locale="">
  <transliterate rule="{ œ > oe "/>
  <transliterate rule="{ Œ > oe "/>
  <transliterate rule="{ æ > ae "/>
  <transliterate rule="{ Æ > ae "/>
  <transliterate rule="\'>\ "/>
  <transliterate rule="\u2019>\ "/>
  <transliterate rule="\u02BC>\ "/>
  <transliterate rule="[:Number:] { '-' > '' "/>
  <!-- Remove control characters except \t\n\r -->
  <transform rule="[\x00-\x08\x0B\x0C\x0E-\x1F\x7F] Any-Remove"/>
  <transform rule="[-] Any-Remove"/>
  <tokenize rule="l"/>
  <transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
  <transform rule="NFD"/>
  <transform rule="[:Nonspacing Mark:] Remove"/>
  <transform rule="NFC"/>
  <display/>
  <casemap rule="l"/>
</icu_chain>

echo "Sinti-Swing" | yaz-icu -c words-icu.xml
1 1 'sintiswing' 'SintiSwing'
echo "Sintiswing" | yaz-icu -c words-icu.xml
1 1 'sintiswing' 'Sintiswing'

Now you can see that there's just 1 token. 

If I were you, I'd experiment a bit, as I naively wrote that transform rule without thinking too much. There might be cases where you don't want to remove the hyphen. For example, a French search for "Mont-Royal" might want it to be normalized as "mont royal" and tokenized into "mont" and "royal", so that keyword searches for "mont" or "royal" will still match the record. 

Note that the transliterate rules are very powerful. For example, you could replace that transform rule I added with one of the following:
<transliterate rule="([a-zA-Z]+) { '-' } ([a-zA-Z]+) > '' " />
<transliterate rule="([a-zA-Z]+)'-'([a-zA-Z]+) > $1$2" />

echo "Sinti-Swing" | yaz-icu -c words-icu.xml
1 1 'sintiswing' 'SintiSwing'

Take a look at <transliterate rule="[:Number:] { '-' > '' "/> which already exists to remove hyphens when they follow a number. 

What I'm trying to say is that the ICU rules are very powerful, but you have to be careful with how you use them. While it's trivial to fix the Sinti-Swing example, creating that "fix" might actually "break" something else. I think it comes down to trade-offs, and that's something that you'll have to think about as you're configuring your ICU rules.

Remember that this file is used both at index time *and* search time (as far as I know). Rules that might make sense at index time might not make sense at search time. I'm not familiar with hyphen usage in German, so I wouldn't really know what would make sense. 

Anyway, I hope that's more helpful!

David Cook
Systems Librarian
Prosentient Systems
72/330 Wattle St
Ultimo, NSW 2007
Australia

Office: 02 9212 0899
Direct: 02 8005 0595

-----Original Message-----
From: Michael Kuhn <mik at adminkuhn.ch> 
Sent: Thursday, 26 September 2019 4:59 AM
To: dcook at prosentient.com.au; koha at lists.katipo.co.nz
Subject: Re: [Koha] How to make the Koha/Zebra search ignore hyphens?

Hi David

 > I'm glad that I got you a bit further on your journey. It's a shame  > about having to use the CHR indexing. You can find more information  > here at https://software.indexdata.com/zebra/doc/character-map-files.html.
 >
 > After reading through that, I'm thinking perhaps that CHR indexing  > can't help you.

Thanks for your assessment!

 > You could ask Indexdata for more information, but I'm guessing it  > can't be done with CHR. It should be doable with ICU though.

So I tried to change the Koha-Standard CHR to ICU according to https://wiki.koha-community.org/wiki/ICU_chains_configuration, just using the original configuration of "words-icu.xml" and "phrases-icu.xml", then restarting Zebra and reindexing. But getting a very unexpected result: Now a catalog search

* for "Sintiswing" shows 1 hit

* for "Sinti-Swing" shows 4'222 hits, the hyphen seems to be ignored completely and everything is found that contains either "Sinti" OR "Swing" or both

* for "Sinti Swing" shows 18 hits, the hyphen is used as a breaking character, so any record containing "Sinti-Swing" or "Sinti" AND "Swing"
  is found, but not "Sintiswing"

In short: The Koha standard configuration of ICU ("words-icu.xml" and
"phrases-icu.xml") seems defective to me. The results are much worse than what CHR gives. And of course the desired result isn't there yet anyway.

Do you maybe have a hint where to find some documentation about how to change the behaviour of ICU indexing in the desired way?

Best wishes: Michael
--
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik at adminkuhn.ch · W www.adminkuhn.ch

Am 25.09.19 um 08:34 schrieb dcook at prosentient.com.au:
> Hi Michael,
> 
> I'm glad that I got you a bit further on your journey. It's a shame about having to use the CHR indexing. You can find more information here at https://software.indexdata.com/zebra/doc/character-map-files.html.
> 
> After reading through that, I'm thinking perhaps that CHR indexing can't help you.
> 
> You could ask Indexdata for more information, but I'm guessing it can't be done with CHR. It should be doable with ICU though.
> 
> David Cook
> Systems Librarian
> Prosentient Systems
> 72/330 Wattle St
> Ultimo, NSW 2007
> Australia
> 
> Office: 02 9212 0899
> Direct: 02 8005 0595
> 
> -----Original Message-----
> From: Michael Kuhn <mik at adminkuhn.ch>
> Sent: Wednesday, 25 September 2019 4:47 AM
> To: dcook at prosentient.com.au; koha at lists.katipo.co.nz
> Subject: Re: [Koha] How to make the Koha/Zebra search ignore hyphens?
> 
> Hi David
> 
> Many thanks for your reply and the hints!
> 
> After a standard installation of Koha 18.11 the CHR indexing is used, thus the configuration is done in file "word-phrase-utf.chr".
> 
> A catalog search
> * for "Sintiswing" shows 1 hit
> * for "Sinti-Swing" shows 18 hits, the hyphen is used as a breaking character, so any record containing "Sinti-Swing" or "Sinti" and "Swing"
> is found, but not "Sintiswing"
> 
> I changed the following line, omitting the hyphen (between comma and dot):
> 
> space
> {\001-\040}!"#$%&'\()*+,./:;<=>?@\[\\]^_`\{|}~’{\x88-\x89}{\x98-\x9C}¡¿«»
> 
> After a Zebra reindexing a catalog search
> * for "Sintiswing" shows 1 hit
> * for "Sinti-Swing" now shows only 8 hits, the hyphen is no more used as a breaking character, so any record containing "Sinti Swing" or "Sinti-Swing" is found, but not "Sintiswing"
> 
> I also tried to add "map (-) @" but this leads to the original results.
> 
> In short: My change of configuration didn't lead to the desired result... If searching for "Sintiswing" also "Sinti-Swing" should be found, and vice versa. This is not the case.
> 
> Since I couldn't find any documentation about CHR indexing - does anyone know where to find out more about the CHR way of indexing?
> 
> Best wishes: Michael
> --
> Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik at adminkuhn.ch · W www.adminkuhn.ch
> 
> 
> 
> Am 19.09.19 um 03:29 schrieb dcook at prosentient.com.au:
>> Hi Michael,
>>
>> That's really interesting. I assume that you're using ICU indexing?
>>
>> You could update "phrases-icu.xml" and "words-icu.xml" to strip out hyphens. You would need to re-index all your records afterwards though.
>>
>> I haven't actually tested that particular change, but just taking a little look with both ICU and CHR and it looks like hyphens are used to tokenize. Currently, when you search "Tee-Ei", you're actually searching for "Tee" and "Ei".
>>
>> If you're using ICU, you could add a transform rule before the tokenize rule to remove the hyphen. This would prevent it from tokenizing and then "Tee-Ei" and "Teeei" should retrieve the same records.
>>
>> Beware also that this is a universal change. You might want to check to see if there are hyphens that shouldn't be removed. If so, you may need to make a more complex rule to try to just capture the desired cases.
>>
>> If you're using CHR, you can take a look at word-phrase-utf.chr and remove - from the "Breaking characters" section. You may or may not also need to map it. I'm less familiar with CHR indexing.
>>
>> Anyway, I hope that helps.
>>
>> David Cook
>> Systems Librarian
>> Prosentient Systems
>> 72/330 Wattle St
>> Ultimo, NSW 2007
>> Australia
>>
>> Office: 02 9212 0899
>> Direct: 02 8005 0595
>>
>> -----Original Message-----
>>
>> Date: Wed, 18 Sep 2019 22:46:15 +0200
>> From: 	
>> To: "Koha : access" <koha at lists.katipo.co.nz>
>> Subject: [Koha] How to make the Koha/Zebra search ignore hyphens?
>> Message-ID: <5b63f3b4-76c1-c1f8-f35a-6a33e3b0afa5 at adminkuhn.ch>
>> Content-Type: text/plain; charset=utf-8; format=flowed
>>
>> Hi
>>
>> We have found that, at least in German, there are words or combinations
>> of words that can be written in different ways, and both are correct and
>> are meaning the same, e. g.
>>
>> * Ultraschallmessgerät = Ultraschall-Messgerät
>> * Sintiswing = Sinti-Swing
>> * Teeei = Tee-Ei
>> * Haftpflichtversicherungsgesellschaft =
>> Haftpflicht-Versicherungsgesellschaft
>>
>> This is a general concept in German, so it makes no sense to add a "used
>> for/see from:" in the authority data. Anyway, such words can exist
>> everywhere in the bibliographic record, not only in fields linked to
>> authority fields.
>>
>> Now the question: is there a way how to teach Koha (or Zebra) to look
>> for the second term also when the first term is searched, and vice
>> versa? Or shorter: Just to ignore the hyphens? Using the standard
>> configuration Koha will not find the second term if the first one is
>> searched, and vice bversa.
>>
>> We would appreciate any hint or tip!
>>
>> Best wishes: Michael
>>
> 
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 484 bytes
Desc: not available
URL: <https://lists.katipo.co.nz/pipermail/koha/attachments/20190926/b025cb02/attachment.sig>