Request for documentation -- ICU chains
I did a bit of searching for ICU chains documentation on the Koha wiki, based on Fatone's post 'local language searching problem' ... and I wasn't really satisfied with what I found: The most complete documentation that I found was here: https://wiki.koha-community.org/wiki/Correcting_Search_of_Arabic_records There's also https://wiki.koha-community.org/wiki/Correcting_Search_of_Polish_records I've added https://wiki.koha-community.org/wiki/ICU_chains_configuration but I'm not enough of an expert to explain how to set up transliterate rules, which seem to be an important part of getting ICU to work to its fullest potential. Thanks, --Barton
Hi , Have you checked this : <goog_951512234>http://userguide.icu-project.org/transforms/general/rules On Mon, Oct 10, 2016 at 3:45 AM, Barton Chittenden < barton@bywatersolutions.com> wrote:
I did a bit of searching for ICU chains documentation on the Koha wiki, based on Fatone's post 'local language searching problem' ... and I wasn't really satisfied with what I found:
The most complete documentation that I found was here:
https://wiki.koha-community.org/wiki/Correcting_Search_of_Arabic_records
There's also
https://wiki.koha-community.org/wiki/Correcting_Search_of_Polish_records
I've added https://wiki.koha-community.org/wiki/ICU_chains_configuration
but I'm not enough of an expert to explain how to set up transliterate rules, which seem to be an important part of getting ICU to work to its fullest potential.
Thanks,
--Barton _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz https://lists.katipo.co.nz/mailman/listinfo/koha
-- *Karam Qubsi*
Karam, Thanks for pointing me to those links and adding them to the wiki. Koha community in general, I've refactored https://wiki.koha-community.org/wiki/ICU_chains_configuration https://wiki.koha-community.org/wiki/Correcting_Search_of_Arabic_records https://wiki.koha-community.org/wiki/Correcting_Search_of_Polish_records I copied the modifications to words-icu.xml into https://wiki.koha-community.org/wiki/ICU_Chains_Library If you have customized words-icu.xml or phrases-icu.xml for languages at your library, I would very much like to see sections added to that page -- creating transliteration rules represents a lot of effort, and being able to spread that effort across the community, as well as having a complete library of transliterations would be a great resource for Koha. Thanks, --Barton On Mon, Oct 10, 2016 at 7:44 AM, Karam Qubsi <karamqubsi@gmail.com> wrote:
Hi , Have you checked this : <http://goog_951512234>http://userguide.icu-project. org/transforms/general/rules
On Mon, Oct 10, 2016 at 3:45 AM, Barton Chittenden < barton@bywatersolutions.com> wrote:
I did a bit of searching for ICU chains documentation on the Koha wiki, based on Fatone's post 'local language searching problem' ... and I wasn't really satisfied with what I found:
The most complete documentation that I found was here:
https://wiki.koha-community.org/wiki/Correcting_Search_of_Arabic_records
There's also
https://wiki.koha-community.org/wiki/Correcting_Search_of_Polish_records
I've added https://wiki.koha-community.org/wiki/ICU_chains_configuration
but I'm not enough of an expert to explain how to set up transliterate rules, which seem to be an important part of getting ICU to work to its fullest potential.
Thanks,
--Barton _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz https://lists.katipo.co.nz/mailman/listinfo/koha
-- *Karam Qubsi*
Hi Barton , Actually I did some research about this for Arabic searching but I didn't reach to a stable version of the icu files because of the lake of time and support for this project . Anyway I published what I've done to a blog post in Arabic and someone translated it to English and it was published on the wiki page you've mentioned https://wiki.koha-community.org/wiki/Correcting_Search_of_Ar abic_records I'm not sure what you are trying to do exactly but I think it will not be that complex to overwrite default behavior of icu in the mentioned above page I put example of icu file for arabic : <icu_chain locale="ar"> <transliterate rule="\'>\ "/> <transliterate rule="[:Number:] { '-' > * "/>* <transform rule="[:Control:] Any-Remove"/> <tokenize rule="l"/> <transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/> <transform rule="NFD"/> <transform rule="[:Nonspacing Mark:] Remove"/> <transform rule="NFC"/> <transliterate rule="{ الا > ا "/> <transliterate rule="{ الأ > أ "/> <transliterate rule="{ الإ > إ "/> <transliterate rule="{ الآ > آ "/> .... this will replace every ال in Arabic to the corresponding value , it is not perfect as this might change some "ال " in the middle of the word which not what we want . I believe that we might also be able to use some regex to handle that but I'm not sure if the icu file will accept the regular expressions I didn't try but you might try and this depend on the cases that you are trying to resolve you might not need that at all . so I think the best thing is to start by understanding how we can make transliterate rule which is documented in this page : http://userguide.icu-project.org/transforms/general/rules and then create our rules for the language we want by adding our rules to the words / phrases icu.xml files . in the follwoing way : <transliterate rule="{ thea > a "/> <transliterate rule="{ thew > w "/> I remember when I tried to add a a lot of rules the zebra re indexing process became more slow than before . I wish that can help Please give a look at this email from 2012 : http://koha.1045719.n5.nabble.com/Re-Koha-need-help-in- zebra-indexing-for-Arabic-words-td5730871.html#a5731004 ---------- Forwarded message ---------- From: Karam Qubsi <karamqubsi@gmail.com> Date: Fri, Oct 26, 2012 at 6:58 AM Subject: Re: [Koha-devel] [Koha] need help in zebra indexing for Arabic words To: Paul Poulain <paul.poulain@biblibre.com> Cc: koha-devel@lists.koha-community.org Hi all I solved this in zebra by customizing the transliterate rule in words-icu.xml file I will share a complete file solve this in Arabic soon ! the solution is by adding the following : (for example ) : I will not use here the Arabic characters to make it more simple : if we have language X and in this language we write in connected letter but some letter is not important in the search process , so we have this word " *the*word " in the search the searcher is not interested in finding *the* but he is absolutely search for "word " so I solve this by following this guide : http://userguide.icu-project.o rg/transforms/general/rules#TOC-Context and make zebra convert thew to w and we may have to make this for every letter thea to a _ theb to b >>>> thez to z like in the following : <transliterate rule="{ thea > a "/> <transliterate rule="{ thew > w "/> ... ... .. <transliterate rule="{ thez > z "/> so if some one search for theword the zebra will convert thew to w so searching for word = theword :D and for Arabic : <transliterate rule="{ الا > ا "/> <transliterate rule="{ الب > ب "/> ..... ... ... .. <transliterate rule="{ الي > ي "/> so searching for " بحث" will find "البحث" and this will solve the whole problem :) I wish this will help you Mohamed Thank you Frédéric , Paul Karam On Tue, Oct 11, 2016 at 12:17 AM, Barton Chittenden < barton@bywatersolutions.com> wrote:
Karam,
Thanks for pointing me to those links and adding them to the wiki.
Koha community in general,
I've refactored
https://wiki.koha-community.org/wiki/ICU_chains_configuration https://wiki.koha-community.org/wiki/Correcting_Search_of_Arabic_records https://wiki.koha-community.org/wiki/Correcting_Search_of_Polish_records
I copied the modifications to words-icu.xml into https://wiki.koha- community.org/wiki/ICU_Chains_Library
If you have customized words-icu.xml or phrases-icu.xml for languages at your library, I would very much like to see sections added to that page -- creating transliteration rules represents a lot of effort, and being able to spread that effort across the community, as well as having a complete library of transliterations would be a great resource for Koha.
Thanks,
--Barton
On Mon, Oct 10, 2016 at 7:44 AM, Karam Qubsi <karamqubsi@gmail.com> wrote:
Hi , Have you checked this : <http://goog_951512234>http://userguide.icu-project.o rg/transforms/general/rules
On Mon, Oct 10, 2016 at 3:45 AM, Barton Chittenden < barton@bywatersolutions.com> wrote:
I did a bit of searching for ICU chains documentation on the Koha wiki, based on Fatone's post 'local language searching problem' ... and I wasn't really satisfied with what I found:
The most complete documentation that I found was here:
https://wiki.koha-community.org/wiki/Correcting_Search_of_Arabic_records
There's also
https://wiki.koha-community.org/wiki/Correcting_Search_of_Polish_records
I've added https://wiki.koha-community.org/wiki/ICU_chains_configuration
but I'm not enough of an expert to explain how to set up transliterate rules, which seem to be an important part of getting ICU to work to its fullest potential.
Thanks,
--Barton _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz https://lists.katipo.co.nz/mailman/listinfo/koha
-- *Karam Qubsi*
-- *Karam Qubsi*
participants (2)
-
Barton Chittenden -
Karam Qubsi