[Koha] How to make the Koha/Zebra search ignore hyphens?

Michael Kuhn mik at adminkuhn.ch
Thu Sep 26 06:59:15 NZST 2019


Hi David

 > I'm glad that I got you a bit further on your journey. It's a shame
 > about having to use the CHR indexing. You can find more information
 > here at 
https://software.indexdata.com/zebra/doc/character-map-files.html.
 >
 > After reading through that, I'm thinking perhaps that CHR indexing
 > can't help you.

Thanks for your assessment!

 > You could ask Indexdata for more information, but I'm guessing it
 > can't be done with CHR. It should be doable with ICU though.

So I tried to change the Koha-Standard CHR to ICU according to 
https://wiki.koha-community.org/wiki/ICU_chains_configuration, just 
using the original configuration of "words-icu.xml" and 
"phrases-icu.xml", then restarting Zebra and reindexing. But getting a 
very unexpected result: Now a catalog search

* for "Sintiswing" shows 1 hit

* for "Sinti-Swing" shows 4'222 hits, the hyphen seems to be ignored 
completely and everything is found that contains either "Sinti" OR 
"Swing" or both

* for "Sinti Swing" shows 18 hits, the hyphen is used as a breaking 
character, so any record containing "Sinti-Swing" or "Sinti" AND "Swing"
  is found, but not "Sintiswing"

In short: The Koha standard configuration of ICU ("words-icu.xml" and 
"phrases-icu.xml") seems defective to me. The results are much worse 
than what CHR gives. And of course the desired result isn't there yet 
anyway.

Do you maybe have a hint where to find some documentation about how to 
change the behaviour of ICU indexing in the desired way?

Best wishes: Michael
-- 
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
T 0041 (0)61 261 55 61 · E mik at adminkuhn.ch · W www.adminkuhn.ch





Am 25.09.19 um 08:34 schrieb dcook at prosentient.com.au:
> Hi Michael,
> 
> I'm glad that I got you a bit further on your journey. It's a shame about having to use the CHR indexing. You can find more information here at https://software.indexdata.com/zebra/doc/character-map-files.html.
> 
> After reading through that, I'm thinking perhaps that CHR indexing can't help you.
> 
> You could ask Indexdata for more information, but I'm guessing it can't be done with CHR. It should be doable with ICU though.
> 
> David Cook
> Systems Librarian
> Prosentient Systems
> 72/330 Wattle St
> Ultimo, NSW 2007
> Australia
> 
> Office: 02 9212 0899
> Direct: 02 8005 0595
> 
> -----Original Message-----
> From: Michael Kuhn <mik at adminkuhn.ch>
> Sent: Wednesday, 25 September 2019 4:47 AM
> To: dcook at prosentient.com.au; koha at lists.katipo.co.nz
> Subject: Re: [Koha] How to make the Koha/Zebra search ignore hyphens?
> 
> Hi David
> 
> Many thanks for your reply and the hints!
> 
> After a standard installation of Koha 18.11 the CHR indexing is used, thus the configuration is done in file "word-phrase-utf.chr".
> 
> A catalog search
> * for "Sintiswing" shows 1 hit
> * for "Sinti-Swing" shows 18 hits, the hyphen is used as a breaking character, so any record containing "Sinti-Swing" or "Sinti" and "Swing"
> is found, but not "Sintiswing"
> 
> I changed the following line, omitting the hyphen (between comma and dot):
> 
> space
> {\001-\040}!"#$%&'\()*+,./:;<=>?@\[\\]^_`\{|}~’{\x88-\x89}{\x98-\x9C}¡¿«»
> 
> After a Zebra reindexing a catalog search
> * for "Sintiswing" shows 1 hit
> * for "Sinti-Swing" now shows only 8 hits, the hyphen is no more used as a breaking character, so any record containing "Sinti Swing" or "Sinti-Swing" is found, but not "Sintiswing"
> 
> I also tried to add "map (-) @" but this leads to the original results.
> 
> In short: My change of configuration didn't lead to the desired result... If searching for "Sintiswing" also "Sinti-Swing" should be found, and vice versa. This is not the case.
> 
> Since I couldn't find any documentation about CHR indexing - does anyone know where to find out more about the CHR way of indexing?
> 
> Best wishes: Michael
> --
> Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik at adminkuhn.ch · W www.adminkuhn.ch
> 
> 
> 
> Am 19.09.19 um 03:29 schrieb dcook at prosentient.com.au:
>> Hi Michael,
>>
>> That's really interesting. I assume that you're using ICU indexing?
>>
>> You could update "phrases-icu.xml" and "words-icu.xml" to strip out hyphens. You would need to re-index all your records afterwards though.
>>
>> I haven't actually tested that particular change, but just taking a little look with both ICU and CHR and it looks like hyphens are used to tokenize. Currently, when you search "Tee-Ei", you're actually searching for "Tee" and "Ei".
>>
>> If you're using ICU, you could add a transform rule before the tokenize rule to remove the hyphen. This would prevent it from tokenizing and then "Tee-Ei" and "Teeei" should retrieve the same records.
>>
>> Beware also that this is a universal change. You might want to check to see if there are hyphens that shouldn't be removed. If so, you may need to make a more complex rule to try to just capture the desired cases.
>>
>> If you're using CHR, you can take a look at word-phrase-utf.chr and remove - from the "Breaking characters" section. You may or may not also need to map it. I'm less familiar with CHR indexing.
>>
>> Anyway, I hope that helps.
>>
>> David Cook
>> Systems Librarian
>> Prosentient Systems
>> 72/330 Wattle St
>> Ultimo, NSW 2007
>> Australia
>>
>> Office: 02 9212 0899
>> Direct: 02 8005 0595
>>
>> -----Original Message-----
>>
>> Date: Wed, 18 Sep 2019 22:46:15 +0200
>> From: 	
>> To: "Koha : access" <koha at lists.katipo.co.nz>
>> Subject: [Koha] How to make the Koha/Zebra search ignore hyphens?
>> Message-ID: <5b63f3b4-76c1-c1f8-f35a-6a33e3b0afa5 at adminkuhn.ch>
>> Content-Type: text/plain; charset=utf-8; format=flowed
>>
>> Hi
>>
>> We have found that, at least in German, there are words or combinations
>> of words that can be written in different ways, and both are correct and
>> are meaning the same, e. g.
>>
>> * Ultraschallmessgerät = Ultraschall-Messgerät
>> * Sintiswing = Sinti-Swing
>> * Teeei = Tee-Ei
>> * Haftpflichtversicherungsgesellschaft =
>> Haftpflicht-Versicherungsgesellschaft
>>
>> This is a general concept in German, so it makes no sense to add a "used
>> for/see from:" in the authority data. Anyway, such words can exist
>> everywhere in the bibliographic record, not only in fields linked to
>> authority fields.
>>
>> Now the question: is there a way how to teach Koha (or Zebra) to look
>> for the second term also when the first term is searched, and vice
>> versa? Or shorter: Just to ignore the hyphens? Using the standard
>> configuration Koha will not find the second term if the first one is
>> searched, and vice bversa.
>>
>> We would appreciate any hint or tip!
>>
>> Best wishes: Michael
>>
> 
> 
> 




More information about the Koha mailing list