[Koha] How to remove unwanted characters when importing MARC data? [SOLVED]

Thu Jun 22 13:33:46 NZST 2017

Hi

Yesterday I wrote:

> Our library receives MARC data from EKZ (a German cataloging data 
> provider) which includes two unwanted characters:
> 
> * a beginning "non-sorting character"
> * an ending "non-sorting character"
> 
> These characters can't be seen in the OPAC and in the hitlist of the 
> staff client, but they do appear in the framework and also in the top 
> line of the webbrowser. Here is an example of a file containing such 
> characters: http://adminkuhn.ch/download/kuhn0000000
> 
> When opening the original .mrc file with vi these characters show as:
> 
> <98>The<9c> obsession
> 
> With "od -c" they show as:
> 
> 302 230   T   h   e 302 234       o   b   s   e   s   s   i   o   n
> 
> Of course these characters could be removed e. g. with sed (but this 
> will result in a wrong character length in MARC LEADER positions 0-4) 
> and also it has to be done separately on the shell outside and before 
> the regular importing process. Or even using software like MarcEdit.
> 
> Now the question is if there is an EASY way how to delete these unwanted 
> characters within Koha, for example by using the MARC modification 
> templates which is used anyway when loading such data?

About four or even five hours later, after trying different ways I have 
finally found the following solution for my case. Unfortunately there is 
no "easy" way - external software is needed:

catmandu convert MARC to MARC --type XML < inputfile | sed -e 
's/\xc2\x98//g' -e 's/\xc2\x9c//g' | catmandu convert MARC --type XML to 
MARC > outputfile

In fact I was playing around with quite some stuff - including character 
representations of course - among them yaz-marcdump (which is part of 
catmandu), xml2marc by Galen Charlton and even Marcedit.

One of the problems I had with Marcedit is I couldn't find a way how to 
remove one single character all over the record. So I finally settled to 
first transform the original MARC file to MARCXML using yaz.marcdump, 
then removing the unwanted characters with sed and finally transforming 
MARCXML back to MARC using Marcedit. Since I'm not very GUI friendly I 
then looked for a tool to do the same on the shell. Unfortunately Galen 
Charltons slim "xml2marc" from 2011 seems to have a problem with 
character sets, thus I went for the fatter catmandu ( 
http://librecat.org/Catmandu/ ) which eventually did the trick.

What I learned is that even a (seemingly) minor change in a MARC record 
can be some kind of real hell. Of course now that I have the solution, 
it looks easy. However, I was also quite surprised it is not possible to 
directly load MARCXML via Koha menu "Tools > Stage MARC records for 
import". And I was mildly deceived when Koha was only telling me "1 
records not staged because of MARC error" but giving me no hint what the 
error really was.

By the way: After deleting the unwanted characters with sed of course 
the record length isn't correct anymore. You may replace the incorrect 
LEADER positions 0-4 with 00000 or just transform MARCXML to MARC - 
Marcedit and catmandu both created correct new LEADER positions 0-4 
automatically.

Thanks again to everybody who helped giving hints and ideas!

Best wishes: Michael
-- 
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
T 0041 (0)61 261 55 61 · E mik at adminkuhn.ch · W www.adminkuhn.ch