[Koha] Request for documentation -- ICU chains

Tue Oct 11 19:14:47 NZDT 2016

Hi Barton ,
Actually I did some research about this for Arabic searching but I didn't
reach to a stable version of the icu files because of the lake of time and
support for this project .

Anyway I published what I've done to a blog post in Arabic and someone
translated it to English and it was published on the wiki page you've
mentioned  https://wiki.koha-community.org/wiki/Correcting_Search_of_Ar
abic_records

I'm not sure what you are trying to do exactly but I think it will not be
that complex to overwrite default behavior of icu

in the mentioned above page I put example of icu file for arabic :

 <icu_chain locale="ar">
   <transliterate rule="\'>\ "/>
   <transliterate rule="[:Number:] { '-' > * "/>*
   <transform rule="[:Control:] Any-Remove"/>
   <tokenize rule="l"/>
   <transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
   <transform rule="NFD"/>
   <transform rule="[:Nonspacing Mark:] Remove"/>
   <transform rule="NFC"/>
   <transliterate rule="{ الا > ا "/>
   <transliterate rule="{ الأ > أ "/>
   <transliterate rule="{ الإ > إ "/>
   <transliterate rule="{ الآ > آ "/>

....

this will replace every ال in Arabic to the corresponding value , it is not
perfect as this might change some "ال " in the middle of the word which not
what we want .
I believe that we might also be able to use some regex to handle  that  but
I'm not sure if the icu file will accept the regular expressions I didn't
try but you might try and this depend on the cases that you are trying to
resolve you might not need that at all .

so I think the best thing is to start by understanding how we can make
transliterate rule which is documented in this page :
http://userguide.icu-project.org/transforms/general/rules

and then create our rules for the language we want by adding our rules to
the words / phrases icu.xml files . in the follwoing way :
  <transliterate rule="{ thea > a "/>
  <transliterate rule="{ thew > w "/>

I remember when I tried to add a a lot of rules the zebra re indexing
process became more slow than before .

I wish that can help

Please give a look at this email from 2012 :
http://koha.1045719.n5.nabble.com/Re-Koha-need-help-in-
zebra-indexing-for-Arabic-words-td5730871.html#a5731004

---------- Forwarded message ----------
From: Karam Qubsi <karamqubsi at gmail.com>
Date: Fri, Oct 26, 2012 at 6:58 AM
Subject: Re: [Koha-devel] [Koha] need help in zebra indexing for Arabic
words
To: Paul Poulain <paul.poulain at biblibre.com>
Cc: koha-devel at lists.koha-community.org

Hi all
I solved this in zebra by customizing the transliterate rule  in
words-icu.xml file

I will share a complete file solve this in Arabic soon !

the solution is by adding the following : (for example ) : I will not use
here the Arabic characters  to make it more simple :

if we have language X and in this language we write in connected letter but
some letter is not important in the search process , so we have this word
" *the*word " in the search the searcher is not interested in finding *the*
but he is absolutely search for "word "

so I solve this by following this guide : http://userguide.icu-project.o
rg/transforms/general/rules#TOC-Context

and make zebra convert thew to w
and we may have to make this for every letter thea to a _ theb to b >>>>
thez to z

like in the following :
  <transliterate rule="{ thea > a "/>
  <transliterate rule="{ thew > w "/>
...
...
..
  <transliterate rule="{ thez > z "/>
so if some one search for theword the zebra will convert thew to w so
searching for word = theword :D

and for Arabic :
  <transliterate rule="{ الا > ا "/>
  <transliterate rule="{ الب > ب "/>
.....
...
...
..

  <transliterate rule="{ الي > ي "/>
so searching for  " بحث"
will find  "البحث"

and this will solve the whole problem :)
I wish this will help you Mohamed

Thank you Frédéric , Paul

Karam

On Tue, Oct 11, 2016 at 12:17 AM, Barton Chittenden <
barton at bywatersolutions.com> wrote:

> Karam,
>
> Thanks for pointing me to those links and adding them to the wiki.
>
> Koha community in general,
>
> I've refactored
>
> https://wiki.koha-community.org/wiki/ICU_chains_configuration
> https://wiki.koha-community.org/wiki/Correcting_Search_of_Arabic_records
> https://wiki.koha-community.org/wiki/Correcting_Search_of_Polish_records
>
> I copied the modifications to words-icu.xml into https://wiki.koha-
> community.org/wiki/ICU_Chains_Library
>
> If you have customized  words-icu.xml or phrases-icu.xml for languages at
> your library, I would very much like to see sections added to that page --
> creating transliteration rules represents a lot of effort, and being able
> to spread that effort across the community, as well as having a complete
> library of transliterations would be a great resource for Koha.
>
> Thanks,
>
> --Barton
>
> On Mon, Oct 10, 2016 at 7:44 AM, Karam Qubsi <karamqubsi at gmail.com> wrote:
>
>> Hi ,
>> Have you checked this :
>> <http://goog_951512234>http://userguide.icu-project.o
>> rg/transforms/general/rules
>>
>>
>>
>> On Mon, Oct 10, 2016 at 3:45 AM, Barton Chittenden <
>> barton at bywatersolutions.com> wrote:
>>
>>> I did a bit of searching for ICU chains documentation on the Koha wiki,
>>> based on Fatone's post 'local language searching problem' ... and I
>>> wasn't
>>> really satisfied with what I found:
>>>
>>> The most complete documentation that I found was here:
>>>
>>> https://wiki.koha-community.org/wiki/Correcting_Search_of_Arabic_records
>>>
>>> There's also
>>>
>>> https://wiki.koha-community.org/wiki/Correcting_Search_of_Polish_records
>>>
>>> I've added https://wiki.koha-community.org/wiki/ICU_chains_configuration
>>>
>>> but I'm not enough of an expert to explain how to set up transliterate
>>> rules, which seem to be an important part of getting ICU to work to its
>>> fullest potential.
>>>
>>> Thanks,
>>>
>>> --Barton
>>> _______________________________________________
>>> Koha mailing list  http://koha-community.org
>>> Koha at lists.katipo.co.nz
>>> https://lists.katipo.co.nz/mailman/listinfo/koha
>>>
>>
>>
>>
>> --
>> *Karam Qubsi*
>>
>>
>

-- 
*Karam Qubsi*