Koha 2.2.9, Unicode (UTF-8), Latin-1 (ISO-8859-1) and migration to Koha 3
Hi list, I have a Koha 2.2.9 system running on a machine with SLES 9 (SUSE Linux Enterpise Server) with SP3 (Service Pack 3), running Apache 2.2.0 and MySQL 4.0.18 I'm now considering migrating it to another machine that will run Koha 3 Beta 2, on a server with SLES 10 with SP1 (Service Pack 1), running Apache 2.2.4 and MySQL 5.0.26 I have done a mysqldump of the koha database in the Koha 2.2.9 system. Unfortunately, I found out that the dump has mixed character encodings, namely that some characters are in Unicode ("UTF-8") and others are in Latin-1 ("ISO 8859-1" and/or "ISO-8859-15"). I am Portuguese (living in Portugal), so the "problematic" characters are the Portuguese accented characters ("ã" - a tilde; "ç" - c cedilla; "é" - e acute; and other characters with accents). This leads to my first question: 1 - Should a Koha 2.2.9 system be preferably set up for Unicode ("UTF-8") or Latin-1 ("ISO-8859-1" / "ISO 8859-15")? By reading the following page: Encoding and Character Sets in Koha http://wiki.koha.org/doku.php?id=encodingscratchpad ... and namely, the first version of that page - http://wiki.koha.org/doku.php?id=encodingscratchpad&rev=1152103445 - it seems that for versions of Koha >= 2.2.6, I should set up the "locale", Apache and MySQL for Unicode ("UTF-8"). Is this correct? My next question is this one: 2 - What is the "best" way to convert this "mixed" mysqldump (UTF-8 / ISO-8859-1) file to a "pure" UTF-8 one (or to a "pure" Latin-1 one)? I have already found out these pages, but would appreciate feedback from fellow Koha users that already had this problem: How to sanitize a string with mixed encodings - UTF-8 and Latin1 http://www.fischerlaender.net/php/sanitize-utf8-latin1 Encoding issues MySql Latin / UTF-8 http://www.vlugge.eu/blog/algemeen/encoding-issues-mysql-latin-utf-8/ Mixed ISO-8859/UTF-8 conversion http://www.perlmonks.org/?node_id=642617 And now, the "main question": 3 - Is the best migration strategy, the following sequence: 3.1. - "Transform" the mysqldump to a "pure" UTF-8 file 3.2. - Install Koha 2.2.9 on the "second" machine (running SLES 10) 3.3 - Import the mysqldump on the "second" machine 3.4 - Install Koha 3 Beta 2 on the "second" machine 3.5 - Follow the steps described at: Upgrading from Koha 2.2 to Koha 3.0 http://wiki.koha.org/doku.php?id=22_to_30 ... or is there an easier / better way to do this? Thanks for taking the time to read this! ANY help / information / feedback would be much appreciated! :) Best wishes, Ricardo Dias Marques lists AT ricmarques DOT net
Hi, On Tue, Apr 22, 2008 at 6:32 AM, Ricardo Dias Marques <lists@ricmarques.net> wrote:
3 - Is the best migration strategy, the following sequence:
3.1. - "Transform" the mysqldump to a "pure" UTF-8 file
Doing a Latin-1 to UTF-8 conversion on the mysqldump directly will likely make any MARC records that are touched unparseable. I suggest as part of your process that you export the MARC bib and authority records separately, fix them using MARC::Record and the techniques you've already identified, then import them back into your 2.2.9 test database. Then you can fix a mysqldump of the non-MARC tables. Regards, Galen -- Galen Charlton Koha Application Developer LibLime galen.charlton@liblime.com p: 1-888-564-2457 x709
Hi Galen, On Tue, Apr 22, 2008, Galen Charlton <galen.charlton@liblime.com> wrote:
Doing a Latin-1 to UTF-8 conversion on the mysqldump directly will likely make any MARC records that are touched unparseable. I suggest as part of your process that you export the MARC bib and authority records separately, fix them using MARC::Record and the techniques you've already identified, then import them back into your 2.2.9 test database. Then you can fix a mysqldump of the non-MARC tables.
First of all, thank you very much for that important tip! Could you please point me to any web page that has some Perl code sample that does what you described, using the MARC::Record module, meaning Perl code that: 1 - Opens a .mrc file that has MARC bibliographic info (we use UNIMARC here) for several records 2 - For each record, sees if it's already in UTF-8: a) If it is already in UTF-8, then skip it b) If it is NOT in UTF-8 (namely because it is in the ISO-8859-1 / Latin-1 encoding / charset), then convert it to UTF-8 3 - Writes a .mrc file with the pure UTF-8 output. I have already read the documentation for the MARC::Record module, located at: http://search.cpan.org/~mikery/MARC-Record-2.0.0/lib/MARC/Record.pm ... but I must admit that I am still a bit confused. :-/ Thanks again! Best wishes, Ricardo Dias Marques lists AT ricmarques DOT net
Hi, On Thu, Apr 24, 2008 at 2:24 PM, Ricardo Dias Marques <lists@ricmarques.net> wrote:
Could you please point me to any web page that has some Perl code sample that does what you described, using the MARC::Record module, meaning Perl code that:
1 - Opens a .mrc file that has MARC bibliographic info (we use UNIMARC here) for several records
2 - For each record, sees if it's already in UTF-8: a) If it is already in UTF-8, then skip it b) If it is NOT in UTF-8 (namely because it is in the ISO-8859-1 / Latin-1 encoding / charset), then convert it to UTF-8
3 - Writes a .mrc file with the pure UTF-8 output.
Very briefly, Koha 3's C4::Charset module's MarcToUTF8Record routine should give you some ideas. You can use that as the core of a routine to convert a file that contains mixed Latin-1 and UTF-8 records to UTF-8. However, it will not correctly handle a MARC record that has *both* Latin-1 and UTF-8, but could be modified to test each field and subfield to see if it contains UTF-8 or Latin-1. Regards, Galen -- Galen Charlton Koha Application Developer LibLime galen.charlton@liblime.com p: 1-888-564-2457 x709
Hi Galen, On Thu, Apr 24, 2008 at 8:45 PM, Galen Charlton <galen.charlton@liblime.com> wrote:
Very briefly, Koha 3's C4::Charset module's MarcToUTF8Record routine should give you some ideas. You can use that as the core of a routine to convert a file that contains mixed Latin-1 and UTF-8 records to UTF-8. However, it will not correctly handle a MARC record that has *both* Latin-1 and UTF-8, but could be modified to test each field and subfield to see if it contains UTF-8 or Latin-1.
Thanks Galen! I have read the code of the MarcToUTF8Record routine, like you suggested, and it does seem to be a very good starting point. If anyone else is curious about the MarcToUTF8Record routine, you may read the (current) source code in the Charset.pm file of Koha 3 (Beta 2), which is also available - in its "current" form - in the "git" web site - http://git.koha.org/ - specifically at: http://git.koha.org/cgi-bin/gitweb.cgi?p=Koha;a=blob;f=C4/Charset.pm I'll try to start experimenting with this when I get back to work on Monday (Friday, 25th of April is a National Holiday in Portugal). Best wishes, Ricardo Dias Marques lists AT ricmarques DOT net
participants (2)
-
Galen Charlton -
Ricardo Dias Marques