[Koha] CKJ and Other Non-Roman Searching

Nicolas Legrand nicolas.legrand at bulac.fr
Fri Sep 11 19:45:10 NZST 2020


A good day, אהלן, こんにちは,

Le jeu. 10 sept. 2020 à 03:16, Charles Kelley <cmkelleymls at gmail.com> a
écrit :

> Hi, all!
>
>     My library has an extensive catalog in CJK, Russian, and a few other
> languages that write in non-Roman writing systems.
>

Ho, mine also :).


>
>     We can import such records into Koha and export the records from Koha.
> Provided we make sure the applications (BBEdit, EndNote, Excel, Word, to
> name a few), browsers (Chrome, Firefox, IE, Safari, etc.), OSs (Linux, Mac
> OS, and Windows) can handle UTF-8, all is well -- for importing and
> exporting. But getting Koha to search CJK has been fruitless, and we are
> terribly frustrated.
>

Been there.


>
>     How does one get Koha to search CJK and Arabic, Cyrillic, Hebrew, and
> other non-Roman writing systems for that matter)?
>
>     Koha 20.05 running on Debian 9.4 "Stretch".
>

Tuning Zebra to search CKJ or languages written with arabic script was a
real pain with Zebra, it is a no brainer with Elastic Search. With the ICU
module enabled, it works very well for CKJ and handles glyphe similarities.
For example, the Library of Congress catalogues Farsi with alef maksura
U+0649  instead of yeh U+06CC. We imported the farsi records from the
Library of Congress and we were unable to find the documents searching with
a farsi keyboard yielding the letter yeh. You can parameter Zebra to handle
this and say U+0649 = U+06CC. With Elastic Search and ICU, you don't have
to, it just works. We lost some day the possibilities to search CKJ with
Zebra and didn't understand how to get it back. Zebra is certainly a very
good search engine. But it's weird and hard to tune. We don't even have to
ask ourselves how to tweak Elastic Search to do it. It just works.

Note that for Chinese, enabling QueryAutoTruncate with Elastic Search may
lead to weird results when you type a full chinese name or title. As of
18.05 this is the case, I didn't check yet if this improved since then. We
enabled it only when “*” is added at the end of a word.

Best regards, יאללה ביי, それでは、また,

-- 

*Nicolas Legrand*
Administration technique et développements du système de gestion de la
bibliothèque

[image: Logo BULAC]

Bibliothèque universitaire
des langues et civilisations

65 rue des Grands Moulins
F-75013 PARIS
T +33 1 81 69 *18 22*
www.bulac.fr


More information about the Koha mailing list