[Koha] Request for documentation -- ICU chains
Karam Qubsi
karamqubsi at gmail.com
Tue Oct 11 19:14:47 NZDT 2016
Hi Barton ,
Actually I did some research about this for Arabic searching but I didn't
reach to a stable version of the icu files because of the lake of time and
support for this project .
Anyway I published what I've done to a blog post in Arabic and someone
translated it to English and it was published on the wiki page you've
mentioned https://wiki.koha-community.org/wiki/Correcting_Search_of_Ar
abic_records
I'm not sure what you are trying to do exactly but I think it will not be
that complex to overwrite default behavior of icu
in the mentioned above page I put example of icu file for arabic :
<icu_chain locale="ar">
<transliterate rule="\'>\ "/>
<transliterate rule="[:Number:] { '-' > * "/>*
<transform rule="[:Control:] Any-Remove"/>
<tokenize rule="l"/>
<transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
<transform rule="NFD"/>
<transform rule="[:Nonspacing Mark:] Remove"/>
<transform rule="NFC"/>
<transliterate rule="{ الا > ا "/>
<transliterate rule="{ الأ > أ "/>
<transliterate rule="{ الإ > إ "/>
<transliterate rule="{ الآ > آ "/>
....
this will replace every ال in Arabic to the corresponding value , it is not
perfect as this might change some "ال " in the middle of the word which not
what we want .
I believe that we might also be able to use some regex to handle that but
I'm not sure if the icu file will accept the regular expressions I didn't
try but you might try and this depend on the cases that you are trying to
resolve you might not need that at all .
so I think the best thing is to start by understanding how we can make
transliterate rule which is documented in this page :
http://userguide.icu-project.org/transforms/general/rules
and then create our rules for the language we want by adding our rules to
the words / phrases icu.xml files . in the follwoing way :
<transliterate rule="{ thea > a "/>
<transliterate rule="{ thew > w "/>
I remember when I tried to add a a lot of rules the zebra re indexing
process became more slow than before .
I wish that can help
Please give a look at this email from 2012 :
http://koha.1045719.n5.nabble.com/Re-Koha-need-help-in-
zebra-indexing-for-Arabic-words-td5730871.html#a5731004
---------- Forwarded message ----------
From: Karam Qubsi <karamqubsi at gmail.com>
Date: Fri, Oct 26, 2012 at 6:58 AM
Subject: Re: [Koha-devel] [Koha] need help in zebra indexing for Arabic
words
To: Paul Poulain <paul.poulain at biblibre.com>
Cc: koha-devel at lists.koha-community.org
Hi all
I solved this in zebra by customizing the transliterate rule in
words-icu.xml file
I will share a complete file solve this in Arabic soon !
the solution is by adding the following : (for example ) : I will not use
here the Arabic characters to make it more simple :
if we have language X and in this language we write in connected letter but
some letter is not important in the search process , so we have this word
" *the*word " in the search the searcher is not interested in finding *the*
but he is absolutely search for "word "
so I solve this by following this guide : http://userguide.icu-project.o
rg/transforms/general/rules#TOC-Context
and make zebra convert thew to w
and we may have to make this for every letter thea to a _ theb to b >>>>
thez to z
like in the following :
<transliterate rule="{ thea > a "/>
<transliterate rule="{ thew > w "/>
...
...
..
<transliterate rule="{ thez > z "/>
so if some one search for theword the zebra will convert thew to w so
searching for word = theword :D
and for Arabic :
<transliterate rule="{ الا > ا "/>
<transliterate rule="{ الب > ب "/>
.....
...
...
..
<transliterate rule="{ الي > ي "/>
so searching for " بحث"
will find "البحث"
and this will solve the whole problem :)
I wish this will help you Mohamed
Thank you Frédéric , Paul
Karam
On Tue, Oct 11, 2016 at 12:17 AM, Barton Chittenden <
barton at bywatersolutions.com> wrote:
> Karam,
>
> Thanks for pointing me to those links and adding them to the wiki.
>
> Koha community in general,
>
> I've refactored
>
> https://wiki.koha-community.org/wiki/ICU_chains_configuration
> https://wiki.koha-community.org/wiki/Correcting_Search_of_Arabic_records
> https://wiki.koha-community.org/wiki/Correcting_Search_of_Polish_records
>
> I copied the modifications to words-icu.xml into https://wiki.koha-
> community.org/wiki/ICU_Chains_Library
>
> If you have customized words-icu.xml or phrases-icu.xml for languages at
> your library, I would very much like to see sections added to that page --
> creating transliteration rules represents a lot of effort, and being able
> to spread that effort across the community, as well as having a complete
> library of transliterations would be a great resource for Koha.
>
> Thanks,
>
> --Barton
>
> On Mon, Oct 10, 2016 at 7:44 AM, Karam Qubsi <karamqubsi at gmail.com> wrote:
>
>> Hi ,
>> Have you checked this :
>> <http://goog_951512234>http://userguide.icu-project.o
>> rg/transforms/general/rules
>>
>>
>>
>> On Mon, Oct 10, 2016 at 3:45 AM, Barton Chittenden <
>> barton at bywatersolutions.com> wrote:
>>
>>> I did a bit of searching for ICU chains documentation on the Koha wiki,
>>> based on Fatone's post 'local language searching problem' ... and I
>>> wasn't
>>> really satisfied with what I found:
>>>
>>> The most complete documentation that I found was here:
>>>
>>> https://wiki.koha-community.org/wiki/Correcting_Search_of_Arabic_records
>>>
>>> There's also
>>>
>>> https://wiki.koha-community.org/wiki/Correcting_Search_of_Polish_records
>>>
>>> I've added https://wiki.koha-community.org/wiki/ICU_chains_configuration
>>>
>>> but I'm not enough of an expert to explain how to set up transliterate
>>> rules, which seem to be an important part of getting ICU to work to its
>>> fullest potential.
>>>
>>> Thanks,
>>>
>>> --Barton
>>> _______________________________________________
>>> Koha mailing list http://koha-community.org
>>> Koha at lists.katipo.co.nz
>>> https://lists.katipo.co.nz/mailman/listinfo/koha
>>>
>>
>>
>>
>> --
>> *Karam Qubsi*
>>
>>
>
--
*Karam Qubsi*
More information about the Koha
mailing list