[Koha] How to remove unwanted characters when importing MARC data? [SOLVED]
Michael Kuhn
mik at adminkuhn.ch
Fri Aug 4 03:31:38 NZST 2017
Hi
Just for the record. On 22 June 2017 I wrote:
> Yesterday I wrote:
>
>> Our library receives MARC data from EKZ (a German cataloging data
>> provider) which includes two unwanted characters:
>>
>> * a beginning "non-sorting character"
>> * an ending "non-sorting character"
>>
>> These characters can't be seen in the OPAC and in the hitlist of the
>> staff client, but they do appear in the framework and also in the top
>> line of the webbrowser. Here is an example of a file containing such
>> characters: http://adminkuhn.ch/download/kuhn0000000
>>
>> When opening the original .mrc file with vi these characters show as:
>>
>> <98>The<9c> obsession
>>
>> With "od -c" they show as:
>>
>> 302 230 T h e 302 234 o b s e s s i o n
>>
>> Of course these characters could be removed e. g. with sed (but this
>> will result in a wrong character length in MARC LEADER positions 0-4)
>> and also it has to be done separately on the shell outside and before
>> the regular importing process. Or even using software like MarcEdit.
>>
>> Now the question is if there is an EASY way how to delete these
>> unwanted characters within Koha, for example by using the MARC
>> modification templates which is used anyway when loading such data?
>
> About four or even five hours later, after trying different ways I have
> finally found the following solution for my case. Unfortunately there is
> no "easy" way - external software is needed:
>
> catmandu convert MARC to MARC --type XML < inputfile | sed -e
> 's/\xc2\x98//g' -e 's/\xc2\x9c//g' | catmandu convert MARC --type XML to
> MARC > outputfile
>
> In fact I was playing around with quite some stuff - including character
> representations of course - among them yaz-marcdump (which is part of
> catmandu), xml2marc by Galen Charlton and even Marcedit.
>
> One of the problems I had with Marcedit is I couldn't find a way how to
> remove one single character all over the record. So I finally settled to
> first transform the original MARC file to MARCXML using yaz.marcdump,
> then removing the unwanted characters with sed and finally transforming
> MARCXML back to MARC using Marcedit. Since I'm not very GUI friendly I
> then looked for a tool to do the same on the shell. Unfortunately Galen
> Charltons slim "xml2marc" from 2011 seems to have a problem with
> character sets, thus I went for the fatter catmandu (
> http://librecat.org/Catmandu/ ) which eventually did the trick.
>
> What I learned is that even a (seemingly) minor change in a MARC record
> can be some kind of real hell. Of course now that I have the solution,
> it looks easy. However, I was also quite surprised it is not possible to
> directly load MARCXML via Koha menu "Tools > Stage MARC records for
> import". And I was mildly deceived when Koha was only telling me "1
> records not staged because of MARC error" but giving me no hint what the
> error really was.
>
> By the way: After deleting the unwanted characters with sed of course
> the record length isn't correct anymore. You may replace the incorrect
> LEADER positions 0-4 with 00000 or just transform MARCXML to MARC -
> Marcedit and catmandu both created correct new LEADER positions 0-4
> automatically.
>
> Thanks again to everybody who helped giving hints and ideas!
The following command I mentioned does NOT convert the first record of
the original MARC file!
catmandu convert MARC to MARC --type XML < inputfile | sed -e
's/\xc2\x98//g' -e 's/\xc2\x9c//g' | catmandu convert MARC --type XML to
MARC > outputfile
I don't know what's the problem (and at the moment I really don't care).
However, the following command will result in an output file also
containing the very first record:
yaz-marcdump -t utf-8 -o marcxml -l 9=97 inputfile | sed -e
's/\xc2\x98//g' -e 's/\xc2\x9c//g' | catmandu convert MARC --type XML to
MARC > outputfile
Just in case someone else will ever use this command.
Best wishes: Michael
--
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
T 0041 (0)61 261 55 61 · E mik at adminkuhn.ch · W www.adminkuhn.ch
More information about the Koha
mailing list