Re: [Koha] How to make the Koha/Zebra search ignore hyphens?

26 Sep 2019

      Hi Katrin, 

I actually did create a transliterate rule which was able to convert the input from "Sinti-Swing" to "SintiSwing Sinti Swing" which created "sintiswing", "sinti", and "swing" as tokens. However, I think the ICU chain file gets used for both indexing and searching, so while it would work for indexing, it wouldn't work for searching, as you'd get hits for irrelevant results. 

The only other thing I can think of is perhaps modifying biblio-zebra-indexdefs.xsl to send the input data twice. Once with the hyphen and once without the hyphen. That would be hugely laborious though I think. (I just noticed we have a chopPunctuation template in biblio-zebra-indexdefs.xsl which I don't think actually gets used.)

I did notice something interesting today when I was looking at ICU: http://userguide.icu-project.org/boundaryanalysis. Observe the following:

Line break:
|Parlez-|vous |français ?|

Word break:
|Parlez|-|vous| |français| |?|

At the moment, we use line break in ICU. I suppose there isn't a huge difference between the two. But I thought it was interesting. I hadn't really thought about it before. 

It feels like there should be a way of having "Mont-Royal" be indexed as "MontRoyal" as well as "Mont" and "Royal". Currently, they're indexed as "Mont" and "Royal", retrieving relevant "Mont-Royal" only records would require using an exact match phrase search to require the proximity of "Mont Royal" in order to get a hit. It would be nice for "Mont-Royal" to retrieve "MontRoyal" while "Royal" would retrieve indexed "Royal" records. 

I wonder what Lucene-based search engines like Solr and Elasticsearch do there...

David Cook
Systems Librarian
Prosentient Systems
72/330 Wattle St
Ultimo, NSW 2007
Australia

Office: 02 9212 0899
Direct: 02 8005 0595

-----Original Message-----

Date: Tue, 24 Sep 2019 21:55:07 +0200
From: Katrin Fischer <katrin.fischer.83@web.de>
To: koha@lists.katipo.co.nz
Subject: Re: [Koha] How to make the Koha/Zebra search ignore hyphens?
Message-ID: <e3aab35e-5260-7091-c37f-c0cdd16c41ba@web.de>
Content-Type: text/plain; charset=utf-8; format=flowed

Hi Michael,

we looked into this ages ago and it didn't seem possible to achieve both
- treating hyphen (-) as a space and not a space at the same time. Maybe
we missed something - If there is a solution, I'd be interested in a
how-to! :)

Katrin

dcook＠prosentient.com.au

tags

participants (1)