How to remove unwanted characters when importing MARC data?
Hi Our library receives MARC data from EKZ (a German cataloging data provider) which includes two unwanted characters: * a beginning "non-sorting character" * an ending "non-sorting character" These characters can't be seen in the OPAC and in the hitlist of the staff client, but they do appear in the framework and also in the top line of the webbrowser. Here is an example of a file containing such characters: http://adminkuhn.ch/download/kuhn0000000 When opening the original .mrc file with vi these characters show as: <98>The<9c> obsession With "od -c" they show as: 302 230 T h e 302 234 o b s e s s i o n Of course these characters could be removed e. g. with sed (but this will result in a wrong character length in MARC LEADER positions 0-4) and also it has to be done separately on the shell outside and before the regular importing process. Or even using software like MarcEdit. Now the question is if there is an EASY way how to delete these unwanted characters within Koha, for example by using the MARC modification templates which is used anyway when loading such data? Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch
Hello Michael, I don't think you'll find an easy way within Koha to do that, maybe bulk edit but I don't know, haven't used it.
From our experience - we've had all sorts of unwanted data like the one you're experiencing and worst - MARCXML is the way to go. Assuming the MARC file is well-formed, convert MARC -> MARCXML (see MARC4J <https://github.com/marc4j/marc4j>, there are others) and apply a custom made XSL (take your pick: xmlstarlet <http://xmlstar.sourceforge.net/docs.php>, xmllint <http://xmlsoft.org/xmllint.html>, xsltproc <http://xmlsoft.org/XSLT/xsltproc.html>, whatever) after. Switch fields, remove unwanted characters, field joining, field splitting, whatever, it can be done with XSL. MarcEdit wouldn't respond to all our needs.
Yes, you'll have to learn XSL if you don't know already and yes it will require time to figure it all out but if you're working with Koha for the long run, you'll virtually be equipped with a tool that'll solve all your data problems in the future. Have a good one, Pedro Amorim 2017-06-21 16:55 GMT+00:00 Michael Kuhn <mik@adminkuhn.ch>:
Hi
Our library receives MARC data from EKZ (a German cataloging data provider) which includes two unwanted characters:
* a beginning "non-sorting character" * an ending "non-sorting character"
These characters can't be seen in the OPAC and in the hitlist of the staff client, but they do appear in the framework and also in the top line of the webbrowser. Here is an example of a file containing such characters: http://adminkuhn.ch/download/kuhn0000000
When opening the original .mrc file with vi these characters show as:
<98>The<9c> obsession
With "od -c" they show as:
302 230 T h e 302 234 o b s e s s i o n
Of course these characters could be removed e. g. with sed (but this will result in a wrong character length in MARC LEADER positions 0-4) and also it has to be done separately on the shell outside and before the regular importing process. Or even using software like MarcEdit.
Now the question is if there is an EASY way how to delete these unwanted characters within Koha, for example by using the MARC modification templates which is used anyway when loading such data?
Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz https://lists.katipo.co.nz/mailman/listinfo/koha
Hi Pedro
I don't think you'll find an easy way within Koha to do that, maybe bulk edit but I don't know, haven't used it.
From our experience - we've had all sorts of unwanted data like the one you're experiencing and worst - MARCXML is the way to go. Assuming the MARC file is well-formed, convert MARC -> MARCXML (see MARC4J <https://github.com/marc4j/marc4j>, there are others) and apply a custom made XSL (take your pick: xmlstarlet <http://xmlstar.sourceforge.net/docs.php>, xmllint <http://xmlsoft.org/xmllint.html>, xsltproc <http://xmlsoft.org/XSLT/xsltproc.html>, whatever) after.
Which of these would you actually recommend? (which ist the "best" one?)
Switch fields, remove unwanted characters, field joining, field splitting, whatever, it can be done with XSL. MarcEdit wouldn't respond to all our needs.
Yes, you'll have to learn XSL if you don't know already and yes it will require time to figure it all out but if you're working with Koha for the long run, you'll virtually be equipped with a tool that'll solve all your data problems in the future.
Until now I have just written my own MARCXML when migrating data, but there was no need to change MARCXML. I know already a bit of XSL but this means someone will have to download the data, then edit it in a way that still has to be found, only then it can be imported into Koha. Since the library acquires such data several times a year the editing process has to be as easy as possible because it will be done by an unsuspecting librarian with no shell experience or even access. Thanks for the hints on XSL - I will put that on my personal todo list, but this gets longer and longer, being filled up with acronyms and no end... Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch
Take a look at C4::Charset::nsb_clean I guess you can add more substitutions there. On Wed, 21 Jun 2017 at 13:55 Michael Kuhn <mik@adminkuhn.ch> wrote:
Hi
Our library receives MARC data from EKZ (a German cataloging data provider) which includes two unwanted characters:
* a beginning "non-sorting character" * an ending "non-sorting character"
These characters can't be seen in the OPAC and in the hitlist of the staff client, but they do appear in the framework and also in the top line of the webbrowser. Here is an example of a file containing such characters: http://adminkuhn.ch/download/kuhn0000000
When opening the original .mrc file with vi these characters show as:
<98>The<9c> obsession
With "od -c" they show as:
302 230 T h e 302 234 o b s e s s i o n
Of course these characters could be removed e. g. with sed (but this will result in a wrong character length in MARC LEADER positions 0-4) and also it has to be done separately on the shell outside and before the regular importing process. Or even using software like MarcEdit.
Now the question is if there is an EASY way how to delete these unwanted characters within Koha, for example by using the MARC modification templates which is used anyway when loading such data?
Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 <+41%2061%20261%2055%2061> · E mik@adminkuhn.ch · W www.adminkuhn.ch _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz https://lists.katipo.co.nz/mailman/listinfo/koha
Hi Jonathan
Take a look at C4::Charset::nsb_clean I guess you can add more substitutions there.
Trouble is I don't know anything about "C4::Charset::nsb_clean". I understand it is a Perl module, but I don't know how to use Perl (excepet of some small stuff like changing existing scripts). So this would be no easy way for me and especially no easy way for the librarian that has to deal with that data... Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch
Hi Michael Did you try marcedit? http://marcedit.reeset.net/ Kind regards Marc Véron Am 21.06.2017 um 18:55 schrieb Michael Kuhn:
Hi
Our library receives MARC data from EKZ (a German cataloging data provider) which includes two unwanted characters:
* a beginning "non-sorting character" * an ending "non-sorting character"
These characters can't be seen in the OPAC and in the hitlist of the staff client, but they do appear in the framework and also in the top line of the webbrowser. Here is an example of a file containing such characters: http://adminkuhn.ch/download/kuhn0000000
When opening the original .mrc file with vi these characters show as:
<98>The<9c> obsession
With "od -c" they show as:
302 230 T h e 302 234 o b s e s s i o n
Of course these characters could be removed e. g. with sed (but this will result in a wrong character length in MARC LEADER positions 0-4) and also it has to be done separately on the shell outside and before the regular importing process. Or even using software like MarcEdit.
Now the question is if there is an EASY way how to delete these unwanted characters within Koha, for example by using the MARC modification templates which is used anyway when loading such data?
Best wishes: Michael
Hi Marc
Did you try marcedit? http://marcedit.reeset.net/
Not yet. As I wrote it should be an easy way (which can be handled by an unsuspecting librarian several times a year). We hoped there is an easier way than install even more software and learn how to use it. After all it's just deleting two characters. But well, I know... "just" and "only" are not valid when dealing with software... sigh. As I wrote to Jane Cothron who also suggested MarcEdit: "It's good to know MarcEdit could do it, so I probably will try to persuade the responsible librarian to install and learn MarcEdit (unfortunately this is not a one-off task but the library acquires such data several times a year)." Thanks for the suggestion! Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch
Salvete!
Not yet. As I wrote it should be an easy way (which can be handled by an unsuspecting librarian several times a year). We hoped there is an easier way than install even more software and learn how to use it. After all it's just deleting two characters. But well, I know... "just" and "only" are not valid when dealing with software... sigh.
As I wrote to Jane Cothron who also suggested MarcEdit: "It's good to know MarcEdit could do it, so I probably will try to persuade the responsible librarian to install and learn MarcEdit (unfortunately this is not a one-off task but the library acquires such data several times a year)."
MarcEdit is a fundamental tool for cataloguing, arguably a must have for almost any ILS. If you're nervous at all, it's free in terms of price. It's also so easy, an ijiot like meself has very little trouble using it. When you get good with it, it can be positively magical. Cheers, Brooke
Hi Brooke
Not yet. As I wrote it should be an easy way (which can be handled by an unsuspecting librarian several times a year). We hoped there is an easier way than install even more software and learn how to use it. After all it's just deleting two characters. But well, I know... "just" and "only" are not valid when dealing with software... sigh.
MarcEdit is a fundamental tool for cataloguing, arguably a must > have for almost any ILS. If you're nervous at all, it's free in terms of price. It's also so easy, an ijiot like meself has very little
As I wrote to Jane Cothron who also suggested MarcEdit: "It's good to know MarcEdit could do it, so I probably will try to persuade the responsible librarian to install and learn MarcEdit (unfortunately this is not a one-off task but the library acquires such data several times a year)." trouble using it. When you get good with it, it can be positively > magical.
As a matter of fact (instead of trying to solve problems with the current Kohadevbox, another task) I was right now downloading Marcedit. I tried it maybe five years ago and hda no use for it then (and as a Linuxite I didn't like the Windows/Wine-dependency). But everything has changed now and I'll give it another try... Then it was just one of many tools, now I got a concrete problem to solve... Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch
Hi Yesterday I wrote:
Our library receives MARC data from EKZ (a German cataloging data provider) which includes two unwanted characters:
* a beginning "non-sorting character" * an ending "non-sorting character"
These characters can't be seen in the OPAC and in the hitlist of the staff client, but they do appear in the framework and also in the top line of the webbrowser. Here is an example of a file containing such characters: http://adminkuhn.ch/download/kuhn0000000
When opening the original .mrc file with vi these characters show as:
<98>The<9c> obsession
With "od -c" they show as:
302 230 T h e 302 234 o b s e s s i o n
Of course these characters could be removed e. g. with sed (but this will result in a wrong character length in MARC LEADER positions 0-4) and also it has to be done separately on the shell outside and before the regular importing process. Or even using software like MarcEdit.
Now the question is if there is an EASY way how to delete these unwanted characters within Koha, for example by using the MARC modification templates which is used anyway when loading such data?
About four or even five hours later, after trying different ways I have finally found the following solution for my case. Unfortunately there is no "easy" way - external software is needed: catmandu convert MARC to MARC --type XML < inputfile | sed -e 's/\xc2\x98//g' -e 's/\xc2\x9c//g' | catmandu convert MARC --type XML to MARC > outputfile In fact I was playing around with quite some stuff - including character representations of course - among them yaz-marcdump (which is part of catmandu), xml2marc by Galen Charlton and even Marcedit. One of the problems I had with Marcedit is I couldn't find a way how to remove one single character all over the record. So I finally settled to first transform the original MARC file to MARCXML using yaz.marcdump, then removing the unwanted characters with sed and finally transforming MARCXML back to MARC using Marcedit. Since I'm not very GUI friendly I then looked for a tool to do the same on the shell. Unfortunately Galen Charltons slim "xml2marc" from 2011 seems to have a problem with character sets, thus I went for the fatter catmandu ( http://librecat.org/Catmandu/ ) which eventually did the trick. What I learned is that even a (seemingly) minor change in a MARC record can be some kind of real hell. Of course now that I have the solution, it looks easy. However, I was also quite surprised it is not possible to directly load MARCXML via Koha menu "Tools > Stage MARC records for import". And I was mildly deceived when Koha was only telling me "1 records not staged because of MARC error" but giving me no hint what the error really was. By the way: After deleting the unwanted characters with sed of course the record length isn't correct anymore. You may replace the incorrect LEADER positions 0-4 with 00000 or just transform MARCXML to MARC - Marcedit and catmandu both created correct new LEADER positions 0-4 automatically. Thanks again to everybody who helped giving hints and ideas! Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch
Hi Just for the record. On 22 June 2017 I wrote:
Yesterday I wrote:
Our library receives MARC data from EKZ (a German cataloging data provider) which includes two unwanted characters:
* a beginning "non-sorting character" * an ending "non-sorting character"
These characters can't be seen in the OPAC and in the hitlist of the staff client, but they do appear in the framework and also in the top line of the webbrowser. Here is an example of a file containing such characters: http://adminkuhn.ch/download/kuhn0000000
When opening the original .mrc file with vi these characters show as:
<98>The<9c> obsession
With "od -c" they show as:
302 230 T h e 302 234 o b s e s s i o n
Of course these characters could be removed e. g. with sed (but this will result in a wrong character length in MARC LEADER positions 0-4) and also it has to be done separately on the shell outside and before the regular importing process. Or even using software like MarcEdit.
Now the question is if there is an EASY way how to delete these unwanted characters within Koha, for example by using the MARC modification templates which is used anyway when loading such data?
About four or even five hours later, after trying different ways I have finally found the following solution for my case. Unfortunately there is no "easy" way - external software is needed:
catmandu convert MARC to MARC --type XML < inputfile | sed -e 's/\xc2\x98//g' -e 's/\xc2\x9c//g' | catmandu convert MARC --type XML to MARC > outputfile
In fact I was playing around with quite some stuff - including character representations of course - among them yaz-marcdump (which is part of catmandu), xml2marc by Galen Charlton and even Marcedit.
One of the problems I had with Marcedit is I couldn't find a way how to remove one single character all over the record. So I finally settled to first transform the original MARC file to MARCXML using yaz.marcdump, then removing the unwanted characters with sed and finally transforming MARCXML back to MARC using Marcedit. Since I'm not very GUI friendly I then looked for a tool to do the same on the shell. Unfortunately Galen Charltons slim "xml2marc" from 2011 seems to have a problem with character sets, thus I went for the fatter catmandu ( http://librecat.org/Catmandu/ ) which eventually did the trick.
What I learned is that even a (seemingly) minor change in a MARC record can be some kind of real hell. Of course now that I have the solution, it looks easy. However, I was also quite surprised it is not possible to directly load MARCXML via Koha menu "Tools > Stage MARC records for import". And I was mildly deceived when Koha was only telling me "1 records not staged because of MARC error" but giving me no hint what the error really was.
By the way: After deleting the unwanted characters with sed of course the record length isn't correct anymore. You may replace the incorrect LEADER positions 0-4 with 00000 or just transform MARCXML to MARC - Marcedit and catmandu both created correct new LEADER positions 0-4 automatically.
Thanks again to everybody who helped giving hints and ideas!
The following command I mentioned does NOT convert the first record of the original MARC file! catmandu convert MARC to MARC --type XML < inputfile | sed -e 's/\xc2\x98//g' -e 's/\xc2\x9c//g' | catmandu convert MARC --type XML to MARC > outputfile I don't know what's the problem (and at the moment I really don't care). However, the following command will result in an output file also containing the very first record: yaz-marcdump -t utf-8 -o marcxml -l 9=97 inputfile | sed -e 's/\xc2\x98//g' -e 's/\xc2\x9c//g' | catmandu convert MARC --type XML to MARC > outputfile Just in case someone else will ever use this command. Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch
participants (5)
-
BWS Johnson -
Jonathan Druart -
Marc Véron -
Michael Kuhn -
Pedro Amorim