[Koha] How to make the Koha/Zebra search ignore hyphens?

Michael Kuhn mik at adminkuhn.ch
Wed Sep 25 06:47:03 NZST 2019


Hi David

Many thanks for your reply and the hints!

After a standard installation of Koha 18.11 the CHR indexing is used, 
thus the configuration is done in file "word-phrase-utf.chr".

A catalog search
* for "Sintiswing" shows 1 hit
* for "Sinti-Swing" shows 18 hits, the hyphen is used as a breaking 
character, so any record containing "Sinti-Swing" or "Sinti" and "Swing" 
is found, but not "Sintiswing"

I changed the following line, omitting the hyphen (between comma and dot):

space 
{\001-\040}!"#$%&'\()*+,./:;<=>?@\[\\]^_`\{|}~’{\x88-\x89}{\x98-\x9C}¡¿«»

After a Zebra reindexing a catalog search
* for "Sintiswing" shows 1 hit
* for "Sinti-Swing" now shows only 8 hits, the hyphen is no more used as 
a breaking character, so any record containing "Sinti Swing" or 
"Sinti-Swing" is found, but not "Sintiswing"

I also tried to add "map (-) @" but this leads to the original results.

In short: My change of configuration didn't lead to the desired 
result... If searching for "Sintiswing" also "Sinti-Swing" should be 
found, and vice versa. This is not the case.

Since I couldn't find any documentation about CHR indexing - does anyone 
know where to find out more about the CHR way of indexing?

Best wishes: Michael
-- 
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
T 0041 (0)61 261 55 61 · E mik at adminkuhn.ch · W www.adminkuhn.ch



Am 19.09.19 um 03:29 schrieb dcook at prosentient.com.au:
> Hi Michael,
> 
> That's really interesting. I assume that you're using ICU indexing?
> 
> You could update "phrases-icu.xml" and "words-icu.xml" to strip out hyphens. You would need to re-index all your records afterwards though.
> 
> I haven't actually tested that particular change, but just taking a little look with both ICU and CHR and it looks like hyphens are used to tokenize. Currently, when you search "Tee-Ei", you're actually searching for "Tee" and "Ei".
> 
> If you're using ICU, you could add a transform rule before the tokenize rule to remove the hyphen. This would prevent it from tokenizing and then "Tee-Ei" and "Teeei" should retrieve the same records.
> 
> Beware also that this is a universal change. You might want to check to see if there are hyphens that shouldn't be removed. If so, you may need to make a more complex rule to try to just capture the desired cases.
> 
> If you're using CHR, you can take a look at word-phrase-utf.chr and remove - from the "Breaking characters" section. You may or may not also need to map it. I'm less familiar with CHR indexing.
> 
> Anyway, I hope that helps.
> 
> David Cook
> Systems Librarian
> Prosentient Systems
> 72/330 Wattle St
> Ultimo, NSW 2007
> Australia
> 
> Office: 02 9212 0899
> Direct: 02 8005 0595
> 
> -----Original Message-----
> 
> Date: Wed, 18 Sep 2019 22:46:15 +0200
> From: 	
> To: "Koha : access" <koha at lists.katipo.co.nz>
> Subject: [Koha] How to make the Koha/Zebra search ignore hyphens?
> Message-ID: <5b63f3b4-76c1-c1f8-f35a-6a33e3b0afa5 at adminkuhn.ch>
> Content-Type: text/plain; charset=utf-8; format=flowed
> 
> Hi
> 
> We have found that, at least in German, there are words or combinations
> of words that can be written in different ways, and both are correct and
> are meaning the same, e. g.
> 
> * Ultraschallmessgerät = Ultraschall-Messgerät
> * Sintiswing = Sinti-Swing
> * Teeei = Tee-Ei
> * Haftpflichtversicherungsgesellschaft =
> Haftpflicht-Versicherungsgesellschaft
> 
> This is a general concept in German, so it makes no sense to add a "used
> for/see from:" in the authority data. Anyway, such words can exist
> everywhere in the bibliographic record, not only in fields linked to
> authority fields.
> 
> Now the question: is there a way how to teach Koha (or Zebra) to look
> for the second term also when the first term is searched, and vice
> versa? Or shorter: Just to ignore the hyphens? Using the standard
> configuration Koha will not find the second term if the first one is
> searched, and vice bversa.
> 
> We would appreciate any hint or tip!
> 
> Best wishes: Michael
> 




More information about the Koha mailing list