Re: [Koha] losing data during import
Wednesday, August 4, 2004 23:28 CDT Hi again, Scott, I took a look at the records in detail. Sorry, they weren't missing the 2 blocks at the end of each as they seemed to be in Word Pad. MARCBreaker can't break it down properly, though, so despite first appearances, they still aren't valid MARC. Something is screwing up the Directory. I think you must be right in that the non-ASCII characters definitely need to be replaced. Is there any other way you can replace the non-ASCII characters first? If you could send a sample of the same (or other) records in their original format off listserv, I can see if another method might work. Cheers, Steven F. Baljkas library tech at large Koha neophyte Winnipeg, MB, Canada P.S. You really shouldn't use the $g in 100 in the way that you did. That's not what it was intended for.
From: Scott Scriven <koha-main@toykeeper.net> Date: 2004/08/04 Wed PM 08:41:18 CDT To: koha@lists.katipo.co.nz Subject: [Koha] losing data during import
Hello.
I'm having some difficulty keeping data intact when I import with the bulkmarcimport.pl script. Specifically, it seems that fields are getting the last 5 bytes chopped off. It seems to be related to character encodings, but I'm not really sure what to do about it. Converting from utf-8 to iso8859-1 seems to change the results, but not correct the problem. Manually replacing all non-ascii characters with safer equivalents seems to cure the problem, but is not feasible for the amount of data I have.
I have a data sample which exhibits this problem; it is a collection of 15 Douglas Adams books:
http://toykeeper.net/tmp/koha/dna.mrc
It was generated from:
http://toykeeper.net/tmp/koha/dna.marcxml http://toykeeper.net/tmp/koha/dna.mods
My conversion process goes from custom data to MODS, then MODS to MARC (xml) using the LoC stylesheets for doing so. It then converts to binary MARC using perl's MARC::Record and MARC::File::XML. Somewhere in the bulkmarcimport.pl script, data is getting lost. It's either MARC::Record failing to read its own files, or in Koha's code somewhere, but I don't know where.
Any hints? I'm hoping I can simply sidestep the conversion to/from binary marc, to avoid the problem; I'll let people know if this is effective.
-- Scott _______________________________________________ Koha mailing list Koha@lists.katipo.co.nz http://lists.katipo.co.nz/mailman/listinfo/koha
* Baljkas Family <baljkas@mts.net> wrote:
I took a look at the records in detail.
Thank you. I've been making very slow progress getting data translated.
I think you must be right in that the non-ASCII characters definitely need to be replaced. Is there any other way you can replace the non-ASCII characters first?
I think I'll try modifying Koha's import script to accept MARC XML files. This way, I might simply sidestep the issue by removing two conversion steps.
If you could send a sample of the same (or other) records in their original format off listserv, I can see if another method might work.
The majority of the data I've converted seems to import correctly. I don't know if it was valid MARC, but that has not prevented the rest from importing, which is what I'm concerned about.
P.S. You really shouldn't use the $g in 100 in the way that you did. That's not what it was intended for.
I'm a bit new to this; is there a better place to put it? Ideally, I'd like to include as much information as possible, and keep the data normalized, but I get the impression that MARC and Koha weren't designed for that. -- Scott
* Scott Scriven <koha-main@toykeeper.net> wrote:
I think I'll try modifying Koha's import script to accept MARC XML files. This way, I might simply sidestep the issue by removing two conversion steps.
Just a note to everyone that this worked, and was a trivial modification. It can support the LoC MARC21slim XML format pretty easily. Here's the diff: --- bulkmarcimport.pl.orig 2004-08-05 11:38:50.000000000 -0600 +++ bulkmarcimport.pl 2004-08-05 13:00:39.000000000 -0600 @@ -5,6 +5,7 @@ # Koha modules used use MARC::File::USMARC; +use MARC::File::XML; use MARC::Record; use MARC::Batch; use C4::Context; @@ -69,7 +70,13 @@ $char_encoding = 'MARC21' unless ($char_encoding); print "CHAR : $char_encoding\n" if $verbose; my $starttime = gettimeofday; -my $batch = MARC::Batch->new( 'USMARC', $input_marc_file ); +my $batch = ""; +if($input_marc_file =~ /xml$/) { + $batch = MARC::Batch->new( 'XML', $input_marc_file ); +} +else { + $batch = MARC::Batch->new( 'USMARC', $input_marc_file ); +} $batch->warnings_off(); $batch->strict_off(); my $i=0; -- Scott
participants (2)
-
Baljkas Family -
Scott Scriven