Hi David Many thanks for your reply and the hints! After a standard installation of Koha 18.11 the CHR indexing is used, thus the configuration is done in file "word-phrase-utf.chr". A catalog search * for "Sintiswing" shows 1 hit * for "Sinti-Swing" shows 18 hits, the hyphen is used as a breaking character, so any record containing "Sinti-Swing" or "Sinti" and "Swing" is found, but not "Sintiswing" I changed the following line, omitting the hyphen (between comma and dot): space {\001-\040}!"#$%&'\()*+,./:;<=>?@\[\\]^_`\{|}~’{\x88-\x89}{\x98-\x9C}¡¿«» After a Zebra reindexing a catalog search * for "Sintiswing" shows 1 hit * for "Sinti-Swing" now shows only 8 hits, the hyphen is no more used as a breaking character, so any record containing "Sinti Swing" or "Sinti-Swing" is found, but not "Sintiswing" I also tried to add "map (-) @" but this leads to the original results. In short: My change of configuration didn't lead to the desired result... If searching for "Sintiswing" also "Sinti-Swing" should be found, and vice versa. This is not the case. Since I couldn't find any documentation about CHR indexing - does anyone know where to find out more about the CHR way of indexing? Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch Am 19.09.19 um 03:29 schrieb dcook@prosentient.com.au:
Hi Michael,
That's really interesting. I assume that you're using ICU indexing?
You could update "phrases-icu.xml" and "words-icu.xml" to strip out hyphens. You would need to re-index all your records afterwards though.
I haven't actually tested that particular change, but just taking a little look with both ICU and CHR and it looks like hyphens are used to tokenize. Currently, when you search "Tee-Ei", you're actually searching for "Tee" and "Ei".
If you're using ICU, you could add a transform rule before the tokenize rule to remove the hyphen. This would prevent it from tokenizing and then "Tee-Ei" and "Teeei" should retrieve the same records.
Beware also that this is a universal change. You might want to check to see if there are hyphens that shouldn't be removed. If so, you may need to make a more complex rule to try to just capture the desired cases.
If you're using CHR, you can take a look at word-phrase-utf.chr and remove - from the "Breaking characters" section. You may or may not also need to map it. I'm less familiar with CHR indexing.
Anyway, I hope that helps.
David Cook Systems Librarian Prosentient Systems 72/330 Wattle St Ultimo, NSW 2007 Australia
Office: 02 9212 0899 Direct: 02 8005 0595
-----Original Message-----
Date: Wed, 18 Sep 2019 22:46:15 +0200 From: To: "Koha : access" <koha@lists.katipo.co.nz> Subject: [Koha] How to make the Koha/Zebra search ignore hyphens? Message-ID: <5b63f3b4-76c1-c1f8-f35a-6a33e3b0afa5@adminkuhn.ch> Content-Type: text/plain; charset=utf-8; format=flowed
Hi
We have found that, at least in German, there are words or combinations of words that can be written in different ways, and both are correct and are meaning the same, e. g.
* Ultraschallmessgerät = Ultraschall-Messgerät * Sintiswing = Sinti-Swing * Teeei = Tee-Ei * Haftpflichtversicherungsgesellschaft = Haftpflicht-Versicherungsgesellschaft
This is a general concept in German, so it makes no sense to add a "used for/see from:" in the authority data. Anyway, such words can exist everywhere in the bibliographic record, not only in fields linked to authority fields.
Now the question: is there a way how to teach Koha (or Zebra) to look for the second term also when the first term is searched, and vice versa? Or shorter: Just to ignore the hyphens? Using the standard configuration Koha will not find the second term if the first one is searched, and vice bversa.
We would appreciate any hint or tip!
Best wishes: Michael