[Koha] how to fix encoding of biblioitems.kohaxml (moving to latest Koha)

Sun Feb 2 19:11:48 NZDT 2014

Hi,

I'm upgrading to Koha latest release (3.14.01.000) from a quite old one 
(2.x).

I've imported the old database, let the web installation procedure perform 
the automatic steps to upgrade the sql structure, then launched the 
convert_to_utf8.pl tool in .../migration_tools/22_to_30/

My stuff (items, patrons, several configuration parameters) seems to be 
there. However, koha_rebuild_zebra fails, because of "wide characters" 
present in the database:

specific error is Wide character in subroutine entry at 
/usr/share/perl5/MARC/Charset/Table.pm line 96

This results in a (hopefully) consistent, but unsearchable database.

A closer inspection of the biblioitems table reveals that several (140 out 
of some 2800) items contain "accented characters" in various fields. These 
may have crept in at various stages, e.g. when inserting new records using 
a Mac keyboard, or on the occasion of not so careful upgrades.

Although I had set the default character set as UTF8 (both for the locale, 
apache and mysql, let alone Koha itself), when I try to insert in the 
database a new record (containing accented letters in the title field, 
say) from the web interface, the title looks ok when accessing the 
newly created record (cgi-bin/koha/catalogue/detail.pl?biblionumber=...), 
whereas the "MARC Preview" shows corrupted accented letters, as if they 
weren't encoded in UTF8.

An even closer inspection suggests that the encoding might indeed be UTF8, 
but that some characters follow the NFD, rather than NFC, convention 
(accented characters are represented as two separate characters).

Is there a way to fix my biblioitems table? Is there a way to have new 
records entered correctly, at least?

(I've even tried ALTER'ing the table to binary, then back to text with the 
correct encoding, field by field.)

I could in principle go and fix each of the 140 corrupted MARC's via the 
web interface, if that's the only way, but the point is that, seemingly, 
even newly entered accented characters produce fancy output in the marcxml 
field.

(The same may possibly apply also to the marc field, but that's a binary 
sql LONGBLOB, therefore possibly uneditable, but I assume it should be 
harmless to koha_rebuild_zebra .)

Many thanks for all your suggestions and help.

Regards,

Giuseppe.