[Koha] How to remove unwanted characters when importing MARC data?

Thu Jun 22 05:21:12 NZST 2017

Hello Michael,

I don't think you'll find an easy way within Koha to do that, maybe bulk
edit but I don't know, haven't used it.

>From our experience - we've had all sorts of unwanted data like the one
you're experiencing and worst - MARCXML is the way to go.
Assuming the MARC file is well-formed, convert MARC -> MARCXML (see MARC4J
<https://github.com/marc4j/marc4j>, there are others) and apply a custom
made XSL (take your pick: xmlstarlet
<http://xmlstar.sourceforge.net/docs.php>, xmllint
<http://xmlsoft.org/xmllint.html>, xsltproc
<http://xmlsoft.org/XSLT/xsltproc.html>, whatever) after. Switch fields,
remove unwanted characters, field joining, field splitting, whatever, it
can be done with XSL. MarcEdit wouldn't respond to all our needs.

Yes, you'll have to learn XSL if you don't know already and yes it will
require time to figure it all out but if you're working with Koha for the
long run, you'll virtually be equipped with a tool that'll solve all your
data problems in the future.

Have a good one,

Pedro Amorim

2017-06-21 16:55 GMT+00:00 Michael Kuhn <mik at adminkuhn.ch>:

> Hi
>
> Our library receives MARC data from EKZ (a German cataloging data
> provider) which includes two unwanted characters:
>
> * a beginning "non-sorting character"
> * an ending "non-sorting character"
>
> These characters can't be seen in the OPAC and in the hitlist of the staff
> client, but they do appear in the framework and also in the top line of the
> webbrowser. Here is an example of a file containing such characters:
> http://adminkuhn.ch/download/kuhn0000000
>
> When opening the original .mrc file with vi these characters show as:
>
> <98>The<9c> obsession
>
> With "od -c" they show as:
>
> 302 230   T   h   e 302 234       o   b   s   e   s   s   i   o   n
>
> Of course these characters could be removed e. g. with sed (but this will
> result in a wrong character length in MARC LEADER positions 0-4) and also
> it has to be done separately on the shell outside and before the regular
> importing process. Or even using software like MarcEdit.
>
> Now the question is if there is an EASY way how to delete these unwanted
> characters within Koha, for example by using the MARC modification
> templates which is used anyway when loading such data?
>
> Best wishes: Michael
> --
> Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
> Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
> T 0041 (0)61 261 55 61 · E mik at adminkuhn.ch · W www.adminkuhn.ch
> _______________________________________________
> Koha mailing list  http://koha-community.org
> Koha at lists.katipo.co.nz
> https://lists.katipo.co.nz/mailman/listinfo/koha
>