character encoding & bulkmarcimport.pl
Hi, I'm wondering if the script bulkmarcimport.pl is simply checking for the value of leader byte 9 if your framework is set to MARC21/USMARC records and then trusting that value. If it's blank it assumes MARC-8 encoding and converts it to UTF-8 using MARC::Charset; if ldr9 is 'a' then it assumes UTF-8 and leaves it as is. Is there something else going on here? Is this a relatively safe approach with MARC21/USMARC records as a whole? Should bulkmarcimport.pl only be used on records that are known to be only MARC-8 and/or UTF-8? I'm wondering if there aren't other non-standard character encodings for MARC21 records out there. For instance, the Wellcome library [1] says it provides records in MARC21 and then says they are in ISO-8859-1 (Latin1) character set. I can imagine there are others out there. I don't know if Latin1 would be a problem, but it seems that other character encodings might be if MARC-8 is assumed to be the character encoding when it isn't. Does the built in Z39.50 search, do character set conversion to UTF-8 as well? Thanks for any help you can provide understanding how Koha handles character encodings. --Jason [1] http://library.wellcome.ac.uk/node58.html#P24_1668
Hi Jason, ----- "Jason Ronallo" <jronallo@gmail.com> wrote:
Hi, I'm wondering if the script bulkmarcimport.pl is simply checking for the value of leader byte 9 if your framework is set to MARC21/USMARC records and then trusting that value. If it's blank it assumes MARC-8 encoding and converts it to UTF-8 using MARC::Charset; if ldr9 is 'a' then it assumes UTF-8 and leaves it as is. Is there something else going on here? One thing to realize is that the buldmarcimport.pl utility is not meant to be used as is, but is a rough example of how to import records into Koha ... it's expected that your systems administrator has experience in programming Perl and can customize it to your local needs.
It relies on the MARC::Record suite which does rely on the LEADER/09 for determining encoding.
Is this a relatively safe approach with MARC21/USMARC records as a whole? Should bulkmarcimport.pl only be used on records that are known to be only MARC-8 and/or UTF-8? It's a safe approach with properly encoded records.
I'm wondering if there aren't other non-standard character encodings for MARC21 records out there. For instance, the Wellcome library [1] says it provides records in MARC21 and then says they are in ISO-8859-1 (Latin1) character set. I can imagine there are others out there. I don't know if Latin1 would be a problem, but it seems that other character encodings might be if MARC-8 is assumed to be the character encoding when it isn't. You'll have to take it up with the Library of Congress. It was certainly short-sighted of them to design the standard with only two possible encodings ...
Does the built in Z39.50 search, do character set conversion to UTF-8 as well? I don't believe it changes the encoding by default, but this is a trivial exercise in Perl.
Thanks for any help you can provide understanding how Koha handles character encodings. Hope that helps.
Cheers, -- Joshua Ferraro SUPPORT FOR OPEN-SOURCE SOFTWARE President, Technology migration, training, maintenance, support LibLime Featuring Koha Open-Source ILS jmf@liblime.com |Full Demos at http://liblime.com/koha |1(888)KohaILS
participants (2)
-
Jason Ronallo -
Joshua M. Ferraro