Searching UTF-8 entries [Koha 3.01.00.002]

Marc Nürnberger

6 Oct 2008 6 Oct '08

2:24 p.m.

Dear All, I recently imported our stock of 70'000 records into Koha via the Koha interface. Now I am kind of wondering, whether I am for some reason unable to search for Chinese (UTF-8 coded) entries, which seemed to work fine when there were only two of them around, or whether Zebra is still on its way indexing the new stuff. Is there a way of telling, whether Zebra is still indexing? (It took like two days until we could search for the western entries from A-Z.) Appreciating any hints. Best regards, Marc

Show replies by date

Galen Charlton

6 Oct 6 Oct

4:29 p.m.

Hi, On Mon, Oct 6, 2008 at 8:24 AM, Marc Nürnberger <marc.nuernberger@gmx.de> wrote:

...

Is there a way of telling, whether Zebra is still indexing? (It took like two days until we could search for the western entries from A-Z.)

If you're using the Zebra queue daemon, you can check how many records are left to index using this SQL query: select count(*) from zebraqueue where done = 0; To index everything in one fell swoop, I suggest using rebuild_zebra.pl -a -b -z; for 70,000, it should take less than an hour. Regards, Galen -- Galen Charlton VP, Research & Development, LibLime galen.charlton@liblime.com p: 1-888-564-2457 x709 skype: gmcharlt

Marc Nürnberger

10:57 p.m.

Hi Galen, Thanks for the help with the rebuild_zebra script. It seems that zebra had already done its job. Even reindexing the whole stock didn't change the fact that the chinese parts of the records are displayed fine, but remain kind of unsearchable. Whenever I search for a Chinese term, I get the same sets of results depending on how many characters the search term was made of, e.g. searching with any one character will give back result nr. 1, searching with any two character long search term will give back result nr. 2, and so on. I can even search with Japaneses or Korean search terms and get the same Chinese sets of results only depending on the length of the search term. But I don't know what this is supposed to tell me. Best regards, Marc Galen Charlton schrieb:

...

Hi,

On Mon, Oct 6, 2008 at 8:24 AM, Marc Nürnberger <marc.nuernberger@gmx.de> wrote:

...
Is there a way of telling, whether Zebra is still indexing? (It took like two days until we could search for the western entries from A-Z.)

If you're using the Zebra queue daemon, you can check how many records are left to index using this SQL query:

select count(*) from zebraqueue where done = 0;

To index everything in one fell swoop, I suggest using rebuild_zebra.pl -a -b -z; for 70,000, it should take less than an hour.

Regards,

Galen

-- Dr. Marc Nürnberger Ludwig-Maximilians-Universität München Institut für Sinologie Kaulbachstr. 51a | D-80539 München | Deutschland Tel: +49 89 2180 3632 Fax: +49 89 342666 nuernberger@lmu.de | www.sinologie.lmu.de

Henri-Damien LAURENT

11:09 p.m.

Marc Nürnberger a écrit :

...

Hi Galen,

Thanks for the help with the rebuild_zebra script. It seems that zebra had already done its job.

Even reindexing the whole stock didn't change the fact that the chinese parts of the records are displayed fine, but remain kind of unsearchable. Whenever I search for a Chinese term, I get the same sets of results depending on how many characters the search term was made of, e.g. searching with any one character will give back result nr. 1, searching with any two character long search term will give back result nr. 2, and so on. I can even search with Japaneses or Korean search terms and get the same Chinese sets of results only depending on the length of the search term. But I don't know what this is supposed to tell me.

Best regards, Marc

afaict, you are not using icu nor are you using yaz with icu. In order to be able to search for utf8 correctly with Yaz and zebra, you should embed icu with yaz. And use the files I am sending along to configure icu chains. Hopes that helps. -- Henri-Damien LAURENT

H.S.

11:36 p.m.

Henri-Damien LAURENT wrote:

...

afaict, you are not using icu nor are you using yaz with icu. In order to be able to search for utf8 correctly with Yaz and zebra, you should embed icu with yaz. And use the files I am sending along to configure icu chains. Hopes that helps.

Krishnan Mani

7 Oct 7 Oct

3:27 a.m.

Hello HS, i am not using Zebra, but most of the stuff from my experience below will work regardless: Hopefully, you followed the install instructions that help ensure the system is setup to use UTF-8 (these appear in more than one of the install guides on kohadocs and elsewhere, for e.g., see INSTALL.fedora7 in the root folder when you unzip koha-3.00.00-stableRC1). i am quoting only a *snippet* of the install instructions below: QUOTE To check, open a terminal window and type the locale. You must obtain: LANG = en_GB.UTF-8 .... 1.4.2 The web server Apache 2 is it configured to use UNICODE? To verify, using a text editor, open the httpd.conf file located in / etc / httpd / conf and look if it contains the directive: AddDefaultCharset UTF-8 1.4.3 The MySQL server is it configured to use UNICODE? NB: mysqld service must be started. In a Terminal window, type the command mysql, then an SQL to display variables content server configuration. Text of query: show where variable_name variables like 'char%'; UNQUOTE Once you have done so, its time to test whether Koha is indeed setup to use UTF-8 well. Choose to search for and obtain MARC records for books in language scripts that need UTF-8. Thanks to the internet, it is possible to do so even without support for keyboard layouts in that language being available and installed on the system you are using to test Koha. For e.g.: 1) I shall search for a book in the Hindi language (this uses the Devnagiri script) at the online catalog of Delhi Public Library (http://59.176.17.111/cgi-bin/koha/opac-main.pl) 2) To obtain suitable text to search for, i use the Google Indic transliteration online tool (http://www.google.co.in/transliterate/indic) I enter the word "Kahani" and the tool promptly gives me कहानी in the Hindi script. "Kahani" is the Hindi word for "story", so i guess you should find enough books with this. 3) Sure enough, the OPAC search returns a number of books. 4) i then select a book and choose to download the MARC record for the book, taking care to use a UTF-8 format 5) Import the MARC record into your catalog, and later try searching for it the same way as you performed the search at the Delhi Public Library website. If this works, i guess you are good to try anything that needs UTF-8. Remember that you need to be careful when transmitting/editing UTF-8 content, that at no point during the entire process, you introduce or try to save data that is not in UTF-8 encoding. Thanks and regards, krishnan mani Pune, India --- On Tue, 7/10/08, H.S. <hs.samix@gmail.com> wrote: From: H.S. <hs.samix@gmail.com> Subject: Re: [Koha] Searching UTF-8 entries [Koha 3.01.00.002] To: koha@lists.katipo.co.nz Date: Tuesday, 7 October, 2008, 3:06 AM Henri-Damien LAURENT wrote:

...

afaict, you are not using icu nor are you using yaz with icu. In order to be able to search for utf8 correctly with Yaz and zebra, you should embed icu with yaz. And use the files I am sending along to configure icu chains. Hopes that helps.

I am in the process of setting up Koha for a very small community library (one to two thousand books). Entering the books data into Koha has been started but is a very slow process due to limited volunteers. So, how do I confirm that ICU has been installed on my setup and that the search is going to support utf-8? Currently, only a few books are in the catalog and all of them are in English. Search works in those, but not sure if my installation is utf-8 complaint. _______________________________________________ Koha mailing list Koha@lists.katipo.co.nz http://lists.katipo.co.nz/mailman/listinfo/koha Get perfect Email ID for your Resume. Grab now http://in.promos.yahoo.com/address

H.S.

3:51 a.m.

Krishnan Mani wrote:

...

Hello HS,

i am not using Zebra, but most of the stuff from my experience below will work regardless:

Hopefully, you followed the install instructions that help ensure the system is setup to use UTF-8 (these appear in more than one of the install guides on kohadocs and elsewhere, for e.g., see INSTALL.fedora7 in the root folder when you unzip koha-3.00.00-stableRC1).

i am quoting only a *snippet* of the install instructions below: QUOTE To check, open a terminal window and type the locale. You must obtain: LANG = en_GB.UTF-8

Check. I have: $> locale LANG=en_CA.UTF-8 LANGUAGE=en_CA:en LC_CTYPE="en_CA.UTF-8" LC_NUMERIC="en_CA.UTF-8" LC_TIME=en_DK.UTF-8 LC_COLLATE="en_CA.UTF-8" LC_MONETARY="en_CA.UTF-8" LC_MESSAGES="en_CA.UTF-8" LC_PAPER="en_CA.UTF-8" LC_NAME="en_CA.UTF-8" LC_ADDRESS="en_CA.UTF-8" LC_TELEPHONE="en_CA.UTF-8" LC_MEASUREMENT="en_CA.UTF-8" LC_IDENTIFICATION="en_CA.UTF-8" LC_ALL=

...

.... 1.4.2 The web server Apache 2 is it configured to use UNICODE?

To verify, using a text editor, open the httpd.conf file located in / etc / httpd / conf and look if it contains the directive:

AddDefaultCharset UTF-8

Check. I have: $> cat /etc/apache2/conf.d/charset # <SNIP> AddCharset UTF-8 .utf8 AddDefaultCharset UTF-8

...

1.4.3 The MySQL server is it configured to use UNICODE?

NB: mysqld service must be started. In a Terminal window, type the command mysql, then an SQL to display variables content server configuration. Text of query:

show where variable_name variables like 'char%'; UNQUOTE

This I am not sure about. I am not getting any conf variable with utf in it: $> pwd /etc/mysql $> sudo grep -r utf * $> So I think I need to fix this before proceeding further. Thanks.

H. S.

4:26 a.m.

On Mon, Oct 6, 2008 at 9:27 PM, Krishnan Mani <krishnanm75@yahoo.com> wrote:

...

1.4.3 The MySQL server is it configured to use UNICODE?

NB: mysqld service must be started. In a Terminal window, type the command mysql, then an SQL to display variables content server configuration. Text of query:

show where variable_name variables like 'char%'; UNQUOTE

H.S.

6:03 a.m.

H. S. wrote:

...

In mysql that i have, I get this:

mysql> SHOW VARIABLES LIKE 'character\_set\_%'; +--------------------------+--------+ | Variable_name | Value | +--------------------------+--------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | +--------------------------+--------+ 7 rows in set (0.00 sec)

So even though my system is utf8, mysql databases are being created in latin1. Is my understanding correct?

Just found out that my koha database is in latin1 encoding: mysql> SHOW CREATE DATABASE koha; +----------+-----------------------------------------------------------------+ | Database | Create Database | +----------+-----------------------------------------------------------------+ | koha | CREATE DATABASE `koha` /*!40100 DEFAULT CHARACTER SET latin1 */ | +----------+-----------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> Now the next big thing is to convert the database to utf8. Now, I have some text in a different language (for example the name of the library). It is displayed properly in the browser. Isn't that text saved in one of the tables? And what will happen to that if I covert all the tables to utf8 now? Thanks.

Krishnan Mani

6:37 a.m.

Hi HS, Yes, you are definitely on your way now from your last few posts. What you see as the output from your mysql client are the Defaults for character_set_client, etc. However, a MySQL client can "interactively negotiate" the choice of encoding with MySQL server. That's may be the reason why some of your text may still display right, but you need to ensure you are indeed using language data that needs UTF-8 (any 'double-byte' characters are a good bet, like Chinese pictograms). My guess is that some of the text that is in a different language may not need UTF-8 to display correctly if Latin1 encoding is adequate to represent it. This should certainly be the case with some European languages (though i haven't verified this as i write) i hope that you are now following one of the good install guides. Remember that i only pasted an EXCERPT from them. I haven't tried switching the encoding on an *existing* Koha (or any other) database. My guess is that it may not be an "automatic" operation, where you expect everything to work just as before without doing anything with the data. i have never even "migrated" from one Koha database or version to another, so i am out of my depth here. Since it looks like you didn't catalog a whole lot, the best bet may be to start with a fresh Koha database. Perhaps you may be able to export your holdings and import them afresh. An important disclaimer i forgot to add to my earlier post: Your choice of browser or e-mail client and configuration may even prevent you from correctly viewing the Hindi text i pasted into the e-mail. The correct approach is to have an image of the text next to the text itself and see how they display in your browser or client. There are some pages on Wikipedia that will do this for you and help you check if your browser is correctly setup (i don't recall which ones at the moment) Thanks and regards, krishnan mani Pune, India --- On Tue, 7/10/08, H.S. <hs.samix@gmail.com> wrote: From: H.S. <hs.samix@gmail.com> Subject: Re: [Koha] Searching UTF-8 entries [Koha 3.01.00.002] To: koha@lists.katipo.co.nz Date: Tuesday, 7 October, 2008, 9:33 AM H. S. wrote:

...

In mysql that i have, I get this:

mysql> SHOW VARIABLES LIKE 'character\_set\_%'; +--------------------------+--------+ | Variable_name | Value | +--------------------------+--------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | +--------------------------+--------+ 7 rows in set (0.00 sec)

So even though my system is utf8, mysql databases are being created in latin1. Is my understanding correct?

H.S.

5:36 p.m.

New subject: Koha and utf8 suport [was: Re: Searching UTF-8 entries [Koha 3.01.00.002]]

H.S. wrote:

...

H. S. wrote:

...
In mysql that i have, I get this:

mysql> SHOW VARIABLES LIKE 'character\_set\_%'; +--------------------------+--------+ | Variable_name | Value | +--------------------------+--------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | +--------------------------+--------+ 7 rows in set (0.00 sec)

So even though my system is utf8, mysql databases are being created in latin1. Is my understanding correct?

Just found out that my koha database is in latin1 encoding: mysql> SHOW CREATE DATABASE koha; +----------+-----------------------------------------------------------------+ | Database | Create Database | +----------+-----------------------------------------------------------------+ | koha | CREATE DATABASE `koha` /*!40100 DEFAULT CHARACTER SET latin1 */ | +----------+-----------------------------------------------------------------+ 1 row in set (0.00 sec) mysql>

So here is what I did. I made second installation of Koha for testing purposes (on different ports, obiously) so that I could keep this test installation separate from the original installation. The test installation has mysql database named koha_test and its associated mysql user and password. Next, I dumped my original koha database (which I know is in latin1, see above): $> mysqldump -uroot -p --opt --default-character-set=latin1 --skip-set-charset koha > 20081006-koha-latin1.sql Next, I restored the dump to the test installation: $> mysql -uroot -pgsqdb514 koha_test < 20081006-koha-latin1.sql And tried importing a book with ISBN 8173802130 from a Z39.50 search. Since this test installation was a copy of the original, the result was as before. The title is shown in Opac as is shown in this picture: http://picasaweb.google.ca/hs.samix/Screenshots#5253043585127255666 Next, I tried to remake the test db supporting utf8. I converted my original koha db dump to utf8. $> iconv -f ISO-8859-1 -t UTF-8 20081006-koha-latin1.sql > 20081006-koha-utf8.sql $> perl -pi -w -e 's/CHARSET=latin1/CHARSET=utf8/g;' 20081006-koha-utf8.sql Then I dropped and recreated the test db supporting utf8: $> mysql --user=root -p --execute="DROP DATABASE koha_test; CREATE DATABASE koha_test CHARACTER SET utf8 COLLATE utf8_unicode_ci;" Enter password: $> And restored the utf8 db data: $> mysql -uroot -p koha_test < 20081006-koha-utf8.sql Enter password: $> At this point, the name of the library (the header in Opac page), which was not in English, appears as a series of questions marks: ??????.... So, something is not gone right. I then again searched and imported the book with ISBN 8173802130 and added this item. Then searched for it in opac and get the title in the same appearance as shown in the image linked earlier in this post. Can somebody make out what I have missed above in creating this koha_test db which supports utf8? Thanks.

Joe Atzberger

9:37 p.m.

New subject: Koha and utf8 suport [was: Re: Searching UTF-8 entries [Koha 3.01.00.002]]

The best reference I've found for confirmed UTF8 setup is referenced by the main INSTALL document: http://wiki.*koha*.org/doku.php?id=*encodingscratchpad*<http://wiki.koha.org/doku.php?id=encodingscratchpad> HS, you need to consider that there are encodings set for Database, Table, AND Session. So even if you successfully loaded the tables and DB as UTF8, if your session is latin1, then you won't be able to view all data in the DB. Make sure you check the "SHOW VARIABLES..." as the user that will be connecting when Apache is accessed. --joe On Tue, Oct 7, 2008 at 11:36 AM, H.S. <hs.samix@gmail.com> wrote:

...

H.S. wrote:

...
H. S. wrote:

...
In mysql that i have, I get this:

mysql> SHOW VARIABLES LIKE 'character\_set\_%'; +--------------------------+--------+ | Variable_name | Value | +--------------------------+--------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | +--------------------------+--------+ 7 rows in set (0.00 sec)

So even though my system is utf8, mysql databases are being created in latin1. Is my understanding correct?

Just found out that my koha database is in latin1 encoding: mysql> SHOW CREATE DATABASE koha;

+----------+-----------------------------------------------------------------+

...
| Database | Create Database |

+----------+-----------------------------------------------------------------+

...
| koha | CREATE DATABASE `koha` /*!40100 DEFAULT CHARACTER SET latin1 */ |

+----------+-----------------------------------------------------------------+

...
1 row in set (0.00 sec) mysql>

So here is what I did. I made second installation of Koha for testing purposes (on different ports, obiously) so that I could keep this test installation separate from the original installation. The test installation has mysql database named koha_test and its associated mysql user and password.

Next, I dumped my original koha database (which I know is in latin1, see above): $> mysqldump -uroot -p --opt --default-character-set=latin1 --skip-set-charset koha > 20081006-koha-latin1.sql

Next, I restored the dump to the test installation: $> mysql -uroot -pgsqdb514 koha_test < 20081006-koha-latin1.sql

And tried importing a book with ISBN 8173802130 from a Z39.50 search. Since this test installation was a copy of the original, the result was as before. The title is shown in Opac as is shown in this picture: http://picasaweb.google.ca/hs.samix/Screenshots#5253043585127255666

Next, I tried to remake the test db supporting utf8. I converted my original koha db dump to utf8.

$> iconv -f ISO-8859-1 -t UTF-8 20081006-koha-latin1.sql > 20081006-koha-utf8.sql $> perl -pi -w -e 's/CHARSET=latin1/CHARSET=utf8/g;' 20081006-koha-utf8.sql

Then I dropped and recreated the test db supporting utf8: $> mysql --user=root -p --execute="DROP DATABASE koha_test; CREATE DATABASE koha_test CHARACTER SET utf8 COLLATE utf8_unicode_ci;" Enter password: $>

And restored the utf8 db data: $> mysql -uroot -p koha_test < 20081006-koha-utf8.sql Enter password: $>

At this point, the name of the library (the header in Opac page), which was not in English, appears as a series of questions marks: ??????....

So, something is not gone right.

I then again searched and imported the book with ISBN 8173802130 and added this item. Then searched for it in opac and get the title in the same appearance as shown in the image linked earlier in this post.

Can somebody make out what I have missed above in creating this koha_test db which supports utf8?

Thanks.

Marc Nürnberger

12:02 p.m.

Hi Henri-Damien, Thank you for your help. Now, I reinstalled Yaz with the --with-icu option, exchanged the default.idx with the one you send me, and placed the attached icu.xml in the same directory. After reindexing I could not find any record at all. Is there anything else to do? Best regards, Marc Henri-Damien LAURENT schrieb:

...

Marc Nürnberger a écrit :

...
Hi Galen,

Thanks for the help with the rebuild_zebra script. It seems that zebra had already done its job.

Even reindexing the whole stock didn't change the fact that the chinese parts of the records are displayed fine, but remain kind of unsearchable. Whenever I search for a Chinese term, I get the same sets of results depending on how many characters the search term was made of, e.g. searching with any one character will give back result nr. 1, searching with any two character long search term will give back result nr. 2, and so on. I can even search with Japaneses or Korean search terms and get the same Chinese sets of results only depending on the length of the search term. But I don't know what this is supposed to tell me.

Best regards, Marc

afaict, you are not using icu nor are you using yaz with icu. In order to be able to search for utf8 correctly with Yaz and zebra, you should embed icu with yaz. And use the files I am sending along to configure icu chains. Hopes that helps.

6418

Age (days ago)

6419

Last active (days ago)

List overview

Download

12 comments

7 participants

participants (7)

Galen Charlton
H. S.
H.S.
Henri-Damien LAURENT
Joe Atzberger
Krishnan Mani
Marc Nürnberger