[Koha] How to remove unwanted characters when importing MARC data? [SOLVED]

Michael Kuhn mik at adminkuhn.ch
Fri Aug 4 03:31:38 NZST 2017


Just for the record. On 22 June 2017 I wrote:

> Yesterday I wrote:
>> Our library receives MARC data from EKZ (a German cataloging data 
>> provider) which includes two unwanted characters:
>> * a beginning "non-sorting character"
>> * an ending "non-sorting character"
>> These characters can't be seen in the OPAC and in the hitlist of the 
>> staff client, but they do appear in the framework and also in the top 
>> line of the webbrowser. Here is an example of a file containing such 
>> characters: http://adminkuhn.ch/download/kuhn0000000
>> When opening the original .mrc file with vi these characters show as:
>> <98>The<9c> obsession
>> With "od -c" they show as:
>> 302 230   T   h   e 302 234       o   b   s   e   s   s   i   o   n
>> Of course these characters could be removed e. g. with sed (but this 
>> will result in a wrong character length in MARC LEADER positions 0-4) 
>> and also it has to be done separately on the shell outside and before 
>> the regular importing process. Or even using software like MarcEdit.
>> Now the question is if there is an EASY way how to delete these 
>> unwanted characters within Koha, for example by using the MARC 
>> modification templates which is used anyway when loading such data?
> About four or even five hours later, after trying different ways I have 
> finally found the following solution for my case. Unfortunately there is 
> no "easy" way - external software is needed:
> catmandu convert MARC to MARC --type XML < inputfile | sed -e 
> 's/\xc2\x98//g' -e 's/\xc2\x9c//g' | catmandu convert MARC --type XML to 
> MARC > outputfile
> In fact I was playing around with quite some stuff - including character 
> representations of course - among them yaz-marcdump (which is part of 
> catmandu), xml2marc by Galen Charlton and even Marcedit.
> One of the problems I had with Marcedit is I couldn't find a way how to 
> remove one single character all over the record. So I finally settled to 
> first transform the original MARC file to MARCXML using yaz.marcdump, 
> then removing the unwanted characters with sed and finally transforming 
> MARCXML back to MARC using Marcedit. Since I'm not very GUI friendly I 
> then looked for a tool to do the same on the shell. Unfortunately Galen 
> Charltons slim "xml2marc" from 2011 seems to have a problem with 
> character sets, thus I went for the fatter catmandu ( 
> http://librecat.org/Catmandu/ ) which eventually did the trick.
> What I learned is that even a (seemingly) minor change in a MARC record 
> can be some kind of real hell. Of course now that I have the solution, 
> it looks easy. However, I was also quite surprised it is not possible to 
> directly load MARCXML via Koha menu "Tools > Stage MARC records for 
> import". And I was mildly deceived when Koha was only telling me "1 
> records not staged because of MARC error" but giving me no hint what the 
> error really was.
> By the way: After deleting the unwanted characters with sed of course 
> the record length isn't correct anymore. You may replace the incorrect 
> LEADER positions 0-4 with 00000 or just transform MARCXML to MARC - 
> Marcedit and catmandu both created correct new LEADER positions 0-4 
> automatically.
> Thanks again to everybody who helped giving hints and ideas!

The following command I mentioned does NOT convert the first record of 
the original MARC file!

catmandu convert MARC to MARC --type XML < inputfile | sed -e
's/\xc2\x98//g' -e 's/\xc2\x9c//g' | catmandu convert MARC --type XML to
MARC > outputfile

I don't know what's the problem (and at the moment I really don't care). 
However, the following command will result in an output file also 
containing the very first record:

yaz-marcdump -t utf-8 -o marcxml -l 9=97 inputfile | sed -e 
's/\xc2\x98//g' -e 's/\xc2\x9c//g' | catmandu convert MARC --type XML to 
MARC > outputfile

Just in case someone else will ever use this command.

Best wishes: Michael
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
T 0041 (0)61 261 55 61 · E mik at adminkuhn.ch · W www.adminkuhn.ch

More information about the Koha mailing list