Re: [Koha] Zebra not updating biblios automatically in koha 3.8
At 08:37 PM 8/31/2012 +0100, Elaine Bradtke wrote:
My colleagues are reporting a similar problem with 3.8.4 Downloading and editing of records seems to function normally, but once it is saved, none of the usual searches brings it up, and running a report on recently catalogued items give zero results.
mysql Ver 14.14 Distrib 5.1.63, for debian-linux-gnu (i486) using readline 6.1 Apache/2.2.16 (Debian) Zebra version:Zebra 2.0.50
Any ideas?
I have found that 3.8.x may have problems with env vars not being available to cron. Please try editing: sudo vi /home/koha/.bashrc Add the following two lines: export PERL5LIB=/usr/share/koha/lib export KOHA_CONF=/etc/koha/koha-conf.xml :x then reboot your server. This in itself cannot do any harm, and could well cure your problem. Best - Paul
Elaine Bradtke
On Fri, Aug 31, 2012 at 12:05 PM, sunil sharma <koha.sunil007@gmail.com> wrote:
Dear All,
I installed Koha 3.8.02 and also check latest version of koha 3.8.04 on centos 6.2, The same problem I am facing in both of these version. when I enter any new biblio, it is not updated automatically to zebra, I am using zebra latest version 2.0.54 and all cronjobs are set. I also manually check zebra functioning by ./rebuild_zebra.pl -b -a -v -z but it shows export biblios zero, but I already added some bilios to koha yet it shows exported zero. And, when I use ./rebuild_zebra.pl -b -a -v -r then it rebuild all the biblios and my all added biblio shows in result i.e. exported 10 biblios. My question is that why ./rebuild_zebra.pl -b -a -v -z option is not working. where is the problem, because my zebra working fine in 3.6, is there any problem in koha 3.8. or any other issue. Please help me out.
Thanx in advance.
Sunil _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/mailman/listinfo/koha
-- Elaine Bradtke Data Wrangler VWML English Folk Dance and Song Society | http://www.efdss.org Cecil Sharp House, 2 Regent's Park Road, London NW1 7AY Tel +44 (0) 20 7485 2206 (This number is for the English Folk Dance and Song Society in London, England. If you wish to phone me personally, send an e-mail first. I work off site) -------------------------------------------------------------------------- Registered Company No. 297142 Charity Registered in England and Wales No. 305999 --------------------------------------------------------------------------- "Writing about music is like dancing about architecture" --Elvis Costello (Musician magazine No. 60 (October 1983), p. 52) _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/mailman/listinfo/koha
--- Maritime heritage and history, preservation and conservation, research and education through the written word and the arts. <http://NavalMarineArchive.com> and <http://UltraMarine.ca>
So environment variables are not the issue. We are carefully managing those. I have tried using the new tool checkNonIndexedBiblios.pl (from patch 6566) and it indeed finds a few recent biblios that are not indexed. Using the -z option to mark them for indexing followed by a manual run of rebuild_zebra -b -v -z did not get the biblios indexed. I cranked up the debugging on zebraidx (by modifying rebuild_zebra.pl and using -v -v) and did not see any obvious errors in the output that would suggest why indexing was failing. We have applied the patch from bug 8563 ( http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=8653) which was definitely a problem for us and fixed out authorities problem. Perhaps there is a related issue here? -Doug On 31 August 2012 13:39, Paul <paul.a@aandc.org> wrote:
At 08:37 PM 8/31/2012 +0100, Elaine Bradtke wrote:
My colleagues are reporting a similar problem with 3.8.4 Downloading and editing of records seems to function normally, but once it is saved, none of the usual searches brings it up, and running a report on recently catalogued items give zero results.
mysql Ver 14.14 Distrib 5.1.63, for debian-linux-gnu (i486) using readline 6.1 Apache/2.2.16 (Debian) Zebra version:Zebra 2.0.50
Any ideas?
I have found that 3.8.x may have problems with env vars not being available to cron. Please try editing:
sudo vi /home/koha/.bashrc
Add the following two lines:
export PERL5LIB=/usr/share/koha/lib export KOHA_CONF=/etc/koha/koha-conf.**xml :x
then reboot your server. This in itself cannot do any harm, and could well cure your problem.
Best - Paul
Elaine Bradtke
On Fri, Aug 31, 2012 at 12:05 PM, sunil sharma <koha.sunil007@gmail.com> wrote:
Dear All,
I installed Koha 3.8.02 and also check latest version of koha 3.8.04 on centos 6.2, The same problem I am facing in both of these version. when
enter any new biblio, it is not updated automatically to zebra, I am using zebra latest version 2.0.54 and all cronjobs are set. I also manually check zebra functioning by ./rebuild_zebra.pl -b -a -v -z but it shows export biblios zero, but I already added some bilios to koha yet it shows exported zero. And, when I use ./rebuild_zebra.pl -b -a -v -r then it rebuild all the biblios and my all added biblio shows in result i.e. exported 10 biblios. My question is that why ./rebuild_zebra.pl -b -a -v -z
I option is
not working. where is the problem, because my zebra working fine in 3.6, is there any problem in koha 3.8. or any other issue. Please help me out.
Thanx in advance.
Sunil ______________________________**_________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/**mailman/listinfo/koha<http://lists.katipo.co.nz/mailman/listinfo/koha>
-- Elaine Bradtke Data Wrangler VWML English Folk Dance and Song Society | http://www.efdss.org Cecil Sharp House, 2 Regent's Park Road, London NW1 7AY Tel +44 (0) 20 7485 2206 (This number is for the English Folk Dance and Song Society in London, England. If you wish to phone me personally, send an e-mail first. I work off site) ------------------------------**------------------------------** -------------- Registered Company No. 297142 Charity Registered in England and Wales No. 305999 ------------------------------**------------------------------** --------------- "Writing about music is like dancing about architecture" --Elvis Costello (Musician magazine No. 60 (October 1983), p. 52) ______________________________**_________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/**mailman/listinfo/koha<http://lists.katipo.co.nz/mailman/listinfo/koha>
--- Maritime heritage and history, preservation and conservation, research and education through the written word and the arts. <http://NavalMarineArchive.com**> and <http://UltraMarine.ca>
______________________________**_________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/**mailman/listinfo/koha<http://lists.katipo.co.nz/mailman/listinfo/koha>
Doug, So environment variables are not the issue. We are carefully managing
those.
Make sure when you are using cron jobs that you set the environment variables IN YOUR CRONTAB. Setting environment variables elsewhere is a recipe for confusion and misery down the road. However, this is -- as you say -- not the problem.
I have tried using the new tool checkNonIndexedBiblios.pl (from patch 6566) and it indeed finds a few recent biblios that are not indexed. Using the -z option to mark them for indexing followed by a manual run of rebuild_zebra -b -v -z did not get the biblios indexed. I cranked up the debugging on zebraidx (by modifying rebuild_zebra.pl and using -v -v) and did not see any obvious errors in the output that would suggest why indexing was failing.
Did you change your bibliographic frameworks? It could be a matter of the biblionumber not being stored properly. The other thing to do is to confirm that the non-indexed biblios are *actually* getting added to the zebraqueue by the 6566 script. It's kind of a long shot, but it could be an issue with the zebraqueue table getting corrupted. I've seen this happen when the zebraqueue table got too large, and disk space was low. Regards, Jared -- Jared Camins-Esakov Bibliographer, C & P Bibliography Services, LLC (phone) +1 (917) 727-3445 (e-mail) jcamins@cpbibliography.com (web) http://www.cpbibliography.com/
On 1 September 2012 09:46, Jared Camins-Esakov <jcamins@cpbibliography.com>wrote:
Doug,
So environment variables are not the issue. We are carefully managing
those.
Make sure when you are using cron jobs that you set the environment variables IN YOUR CRONTAB. Setting environment variables elsewhere is a recipe for confusion and misery down the road. However, this is -- as you say -- not the problem.
I have tried using the new tool checkNonIndexedBiblios.pl (from patch 6566) and it indeed finds a few recent biblios that are not indexed. Using the -z option to mark them for indexing followed by a manual run of rebuild_zebra -b -v -z did not get the biblios indexed. I cranked up the debugging on zebraidx (by modifying rebuild_zebra.pl and using -v -v) and did not see any obvious errors in the output that would suggest why indexing was failing.
Did you change your bibliographic frameworks? It could be a matter of the biblionumber not being stored properly. The other thing to do is to confirm that the non-indexed biblios are *actually* getting added to the zebraqueue by the 6566 script. It's kind of a long shot, but it could be an issue with the zebraqueue table getting corrupted. I've seen this happen when the zebraqueue table got too large, and disk space was low.
So I think this is working as expected. Disk space is ample on the system in question, and the catalogue is small by most standards (about 2500 biblios). I ran rebuild_zebra.pl with the -k flag so it left the exported records and here's the tree I got. library:/tmp# ls -altR p6tjtKrrK3/ p6tjtKrrK3/: total 0 drwxrwxrwt 6 root root 1040 Sep 1 17:50 .. drwx------ 5 koha koha 100 Sep 1 06:36 . drwxr-xr-x 2 koha koha 60 Sep 1 06:36 upd_biblio drwxr-xr-x 2 koha koha 60 Sep 1 06:36 del_biblio drwxr-xr-x 2 koha koha 40 Sep 1 06:36 biblio p6tjtKrrK3/upd_biblio: total 16 -rw-r--r-- 1 koha koha 12670 Sep 1 06:36 exported_records drwxr-xr-x 2 koha koha 60 Sep 1 06:36 . drwx------ 5 koha koha 100 Sep 1 06:36 .. p6tjtKrrK3/del_biblio: total 0 drwx------ 5 koha koha 100 Sep 1 06:36 .. drwxr-xr-x 2 koha koha 60 Sep 1 06:36 . -rw-r--r-- 1 koha koha 0 Sep 1 06:36 exported_records p6tjtKrrK3/biblio: total 0 drwx------ 5 koha koha 100 Sep 1 06:36 .. drwxr-xr-x 2 koha koha 40 Sep 1 06:36 . Using marcprint.py, a small python program built around pymarc package, I decoded this file and find 13 MARC records, as expected. Example: =LDR 00871nam a22002417a 4500 =001 201112071555.ls =003 UkLoVW =005 20111209110116.0 =008 111207t1982\\\\enkg\\\\r\\\\\001\0\eng\d =040 \\$aUkLoVW$cUkLoVW =099 \\$aQS 40 =100 1\$aSheffield, Ken$92330 =245 \0$aTen country dances :$bmainly from Thompson, Wright & Wilson. =260 \\$aOxford :$b[The Author],$c1982. =300 \\$a12 p. :$bmusic ;$c30 cm. =490 1\$aFrom two barns ;$vv. 1 =650 \\$9117$aCountry dances =650 \\$9127$aDance music =830 \5$aFrom two barns$92331 =942 \\$2VWML$cBK$hQS 40$n0$6QS_00040 =999 \\$c14879$d14879 =952 \\$w2011-12-07$p10914$r2011-12-07$40$00$6QS_00040$915083$bVWML$10$oQS 40$d2011-12-07$70$cBOX$2VWML$yBK$aVWML =952 \\$w2011-12-07$p11121$r2011-12-07$40$00$6QS_00040$915084$bVWML$10$oQS 40$d2011-12-07$71$cBOX$2VWML$yBK$aVWML I have attached an ascii printout of all 13 records in case someone wants to look for a pattern in these records. The problem is either in the format/contents of those records, or in zebraidx/zebrasrv or their config files. My suspicion is with the later since we have already had to fix one problem there with for bug 6566. -Doug-
Regards, Jared
-- Jared Camins-Esakov Bibliographer, C & P Bibliography Services, LLC (phone) +1 (917) 727-3445 (e-mail) jcamins@cpbibliography.com (web) http://www.cpbibliography.com/
Hi. The 3.8 upgrade offers the dom indexing by default and if you have taken that option (as seen in $KOHA_CONF) the xsl used instead of record.abs (~/koha-dev/etc/zebradb/marc_defs/marc21/biblios/biblio-zebra-indexdefs.xsl) uses a construct (z:id) for the 001 which uses that (if it exists) as the zebra unique id. This means if you have more than one bib record with the same 001 (as you get if you duplicate a bib for instance) it will only index the last one and it won't complain at all about it. Not sure if it's a hangover from using the xml used by authorities which stores the auth_id in the 001 or UNIMARC which might use 001 as the bib number. Either way I bet if you remove the 001 or make it unique then it will index OK. The better solution is to fix the xsl to probably not use the z:id for biblios or maybe get it to use the 999$c, but the zebra config scares me. It took ages to find the cause so I hope this helps someone. Ian On 01/09/2012 18:11, Doug Kingston wrote:
On 1 September 2012 09:46, Jared Camins-Esakov <jcamins@cpbibliography.com>wrote:
Doug,
So environment variables are not the issue. We are carefully managing
those.
Make sure when you are using cron jobs that you set the environment variables IN YOUR CRONTAB. Setting environment variables elsewhere is a recipe for confusion and misery down the road. However, this is -- as you say -- not the problem.
I have tried using the new tool checkNonIndexedBiblios.pl (from patch 6566) and it indeed finds a few recent biblios that are not indexed. Using the -z option to mark them for indexing followed by a manual run of rebuild_zebra -b -v -z did not get the biblios indexed. I cranked up the debugging on zebraidx (by modifying rebuild_zebra.pl and using -v -v) and did not see any obvious errors in the output that would suggest why indexing was failing.
Did you change your bibliographic frameworks? It could be a matter of the biblionumber not being stored properly. The other thing to do is to confirm that the non-indexed biblios are *actually* getting added to the zebraqueue by the 6566 script. It's kind of a long shot, but it could be an issue with the zebraqueue table getting corrupted. I've seen this happen when the zebraqueue table got too large, and disk space was low.
So I think this is working as expected. Disk space is ample on the system in question, and the catalogue is small by most standards (about 2500 biblios). I ran rebuild_zebra.pl with the -k flag so it left the exported records and here's the tree I got.
library:/tmp# ls -altR p6tjtKrrK3/ p6tjtKrrK3/: total 0 drwxrwxrwt 6 root root 1040 Sep 1 17:50 .. drwx------ 5 koha koha 100 Sep 1 06:36 . drwxr-xr-x 2 koha koha 60 Sep 1 06:36 upd_biblio drwxr-xr-x 2 koha koha 60 Sep 1 06:36 del_biblio drwxr-xr-x 2 koha koha 40 Sep 1 06:36 biblio
p6tjtKrrK3/upd_biblio: total 16 -rw-r--r-- 1 koha koha 12670 Sep 1 06:36 exported_records drwxr-xr-x 2 koha koha 60 Sep 1 06:36 . drwx------ 5 koha koha 100 Sep 1 06:36 ..
p6tjtKrrK3/del_biblio: total 0 drwx------ 5 koha koha 100 Sep 1 06:36 .. drwxr-xr-x 2 koha koha 60 Sep 1 06:36 . -rw-r--r-- 1 koha koha 0 Sep 1 06:36 exported_records
p6tjtKrrK3/biblio: total 0 drwx------ 5 koha koha 100 Sep 1 06:36 .. drwxr-xr-x 2 koha koha 40 Sep 1 06:36 .
Using marcprint.py, a small python program built around pymarc package, I decoded this file and find 13 MARC records, as expected. Example: =LDR 00871nam a22002417a 4500 =001 201112071555.ls =003 UkLoVW =005 20111209110116.0 =008 111207t1982\\\\enkg\\\\r\\\\\001\0\eng\d =040 \\$aUkLoVW$cUkLoVW =099 \\$aQS 40 =100 1\$aSheffield, Ken$92330 =245 \0$aTen country dances :$bmainly from Thompson, Wright & Wilson. =260 \\$aOxford :$b[The Author],$c1982. =300 \\$a12 p. :$bmusic ;$c30 cm. =490 1\$aFrom two barns ;$vv. 1 =650 \\$9117$aCountry dances =650 \\$9127$aDance music =830 \5$aFrom two barns$92331 =942 \\$2VWML$cBK$hQS 40$n0$6QS_00040 =999 \\$c14879$d14879 =952 \\$w2011-12-07$p10914$r2011-12-07$40$00$6QS_00040$915083$bVWML$10$oQS 40$d2011-12-07$70$cBOX$2VWML$yBK$aVWML =952 \\$w2011-12-07$p11121$r2011-12-07$40$00$6QS_00040$915084$bVWML$10$oQS 40$d2011-12-07$71$cBOX$2VWML$yBK$aVWML
I have attached an ascii printout of all 13 records in case someone wants to look for a pattern in these records.
The problem is either in the format/contents of those records, or in zebraidx/zebrasrv or their config files. My suspicion is with the later since we have already had to fix one problem there with for bug 6566.
-Doug-
Regards, Jared
-- Jared Camins-Esakov Bibliographer, C & P Bibliography Services, LLC (phone) +1 (917) 727-3445 (e-mail) jcamins@cpbibliography.com (web) http://www.cpbibliography.com/
_______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/mailman/listinfo/koha
-- Ian Bays Director of Projects, PTFS Europe Limited Content Management and Library Solutions +44 (0) 800 756 6803 (phone) +44 (0) 7774 995297 (mobile) +44 (0) 800 756 6384 (fax) skype: ian.bays email: ian.bays@ptfs-europe.com
So doing some further research, it definitely looks like we have duplicate control numbers (001). This is a data entry mistake and it looks like the cataloger copied the biblios for similar entries. I have gone back and altered the control numbers to be unique, but rebuild_zebra.pl -b -r is not adding the new entries. Any idea what else we might need to do? -Doug- On 1 September 2012 15:32, Ian Bays <ian.bays@ptfs-europe.com> wrote:
Hi. The 3.8 upgrade offers the dom indexing by default and if you have taken that option (as seen in $KOHA_CONF) the xsl used instead of record.abs (~/koha-dev/etc/zebradb/marc_**defs/marc21/biblios/biblio-**zebra-indexdefs.xsl) uses a construct (z:id) for the 001 which uses that (if it exists) as the zebra unique id. This means if you have more than one bib record with the same 001 (as you get if you duplicate a bib for instance) it will only index the last one and it won't complain at all about it. Not sure if it's a hangover from using the xml used by authorities which stores the auth_id in the 001 or UNIMARC which might use 001 as the bib number. Either way I bet if you remove the 001 or make it unique then it will index OK. The better solution is to fix the xsl to probably not use the z:id for biblios or maybe get it to use the 999$c, but the zebra config scares me. It took ages to find the cause so I hope this helps someone. Ian
On 01/09/2012 18:11, Doug Kingston wrote:
On 1 September 2012 09:46, Jared Camins-Esakov <jcamins@cpbibliography.com>**wrote:
Doug,
So environment variables are not the issue. We are carefully managing
those.
Make sure when you are using cron jobs that you set the environment variables IN YOUR CRONTAB. Setting environment variables elsewhere is a recipe for confusion and misery down the road. However, this is -- as you say -- not the problem.
6566) and it indeed finds a few recent biblios that are not indexed. Using the -z option to mark them for indexing followed by a manual run of rebuild_zebra -b -v -z did not get the biblios indexed. I cranked up the debugging on zebraidx (by modifying rebuild_zebra.pl and using -v -v) and did not see any obvious errors in the output that would suggest why indexing was failing.
Did you change your bibliographic frameworks? It could be a matter of
I have tried using the new tool checkNonIndexedBiblios.pl (from patch the biblionumber not being stored properly. The other thing to do is to confirm that the non-indexed biblios are *actually* getting added to the zebraqueue by the 6566 script. It's kind of a long shot, but it could be an issue with the zebraqueue table getting corrupted. I've seen this happen when the zebraqueue table got too large, and disk space was low.
So I think this is working as expected. Disk space is ample on the
system in question, and the catalogue is small by most standards (about 2500 biblios). I ran rebuild_zebra.pl with the -k flag so it left the exported records and here's the tree I got.
library:/tmp# ls -altR p6tjtKrrK3/ p6tjtKrrK3/: total 0 drwxrwxrwt 6 root root 1040 Sep 1 17:50 .. drwx------ 5 koha koha 100 Sep 1 06:36 . drwxr-xr-x 2 koha koha 60 Sep 1 06:36 upd_biblio drwxr-xr-x 2 koha koha 60 Sep 1 06:36 del_biblio drwxr-xr-x 2 koha koha 40 Sep 1 06:36 biblio
p6tjtKrrK3/upd_biblio: total 16 -rw-r--r-- 1 koha koha 12670 Sep 1 06:36 exported_records drwxr-xr-x 2 koha koha 60 Sep 1 06:36 . drwx------ 5 koha koha 100 Sep 1 06:36 ..
p6tjtKrrK3/del_biblio: total 0 drwx------ 5 koha koha 100 Sep 1 06:36 .. drwxr-xr-x 2 koha koha 60 Sep 1 06:36 . -rw-r--r-- 1 koha koha 0 Sep 1 06:36 exported_records
p6tjtKrrK3/biblio: total 0 drwx------ 5 koha koha 100 Sep 1 06:36 .. drwxr-xr-x 2 koha koha 40 Sep 1 06:36 .
Using marcprint.py, a small python program built around pymarc package, I decoded this file and find 13 MARC records, as expected. Example: =LDR 00871nam a22002417a 4500 =001 201112071555.ls =003 UkLoVW =005 20111209110116.0 =008 111207t1982\\\\enkg\\\\r\\\\\**001\0\eng\d =040 \\$aUkLoVW$cUkLoVW =099 \\$aQS 40 =100 1\$aSheffield, Ken$92330 =245 \0$aTen country dances :$bmainly from Thompson, Wright & Wilson. =260 \\$aOxford :$b[The Author],$c1982. =300 \\$a12 p. :$bmusic ;$c30 cm. =490 1\$aFrom two barns ;$vv. 1 =650 \\$9117$aCountry dances =650 \\$9127$aDance music =830 \5$aFrom two barns$92331 =942 \\$2VWML$cBK$hQS 40$n0$6QS_00040 =999 \\$c14879$d14879 =952 \\$w2011-12-07$p10914$r2011-**12-07$40$00$6QS_00040$915083$** bVWML$10$oQS 40$d2011-12-07$70$cBOX$2VWML$**yBK$aVWML =952 \\$w2011-12-07$p11121$r2011-**12-07$40$00$6QS_00040$915084$** bVWML$10$oQS 40$d2011-12-07$71$cBOX$2VWML$**yBK$aVWML
I have attached an ascii printout of all 13 records in case someone wants to look for a pattern in these records.
The problem is either in the format/contents of those records, or in zebraidx/zebrasrv or their config files. My suspicion is with the later since we have already had to fix one problem there with for bug 6566.
-Doug-
Regards,
Jared
-- Jared Camins-Esakov Bibliographer, C & P Bibliography Services, LLC (phone) +1 (917) 727-3445 (e-mail) jcamins@cpbibliography.com (web) http://www.cpbibliography.com/
______________________________**_________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/**mailman/listinfo/koha<http://lists.katipo.co.nz/mailman/listinfo/koha>
-- Ian Bays Director of Projects, PTFS Europe Limited Content Management and Library Solutions +44 (0) 800 756 6803 (phone) +44 (0) 7774 995297 (mobile) +44 (0) 800 756 6384 (fax) skype: ian.bays email: ian.bays@ptfs-europe.com
______________________________**_________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/**mailman/listinfo/koha<http://lists.katipo.co.nz/mailman/listinfo/koha>
Actually rebuild_zebra.pl -b -r -v does the trick. I had broken my rebuild_zebra with some debugging logic while tracking down this problem. zebraidx was not being called... We look good now on this issue. Solution: 1. use 'checkNonIndexedBiblios.pl -c' to get a report of missing biblios. 2. visit http://mylibrary.org/cgi-bin/koha/catalogue/detail.pl?biblionumber=14769 (missing biblio) and then Edit, and fixes the control number to be unique (repeat for each biblio) 3. rebuild_zebra.pl -b -r -v It would be good to build a tool to find duplicate control numbers. I did this by exporting all the biblios, using marcprint (my python utility) | grep "=001" | sort | uniq -c | sort -r | less, and looked for counts greater than 1. A better approach would be to make the control number a unique field and enforce this in the database. Is this possible? -Doug- On 1 September 2012 16:11, Doug Kingston <dpk@randomnotes.org> wrote:
So doing some further research, it definitely looks like we have duplicate control numbers (001). This is a data entry mistake and it looks like the cataloger copied the biblios for similar entries. I have gone back and altered the control numbers to be unique, but rebuild_zebra.pl -b -r is not adding the new entries. Any idea what else we might need to do?
-Doug-
On 1 September 2012 15:32, Ian Bays <ian.bays@ptfs-europe.com> wrote:
Hi. The 3.8 upgrade offers the dom indexing by default and if you have taken that option (as seen in $KOHA_CONF) the xsl used instead of record.abs (~/koha-dev/etc/zebradb/marc_**defs/marc21/biblios/biblio-**zebra-indexdefs.xsl) uses a construct (z:id) for the 001 which uses that (if it exists) as the zebra unique id. This means if you have more than one bib record with the same 001 (as you get if you duplicate a bib for instance) it will only index the last one and it won't complain at all about it. Not sure if it's a hangover from using the xml used by authorities which stores the auth_id in the 001 or UNIMARC which might use 001 as the bib number. Either way I bet if you remove the 001 or make it unique then it will index OK. The better solution is to fix the xsl to probably not use the z:id for biblios or maybe get it to use the 999$c, but the zebra config scares me. It took ages to find the cause so I hope this helps someone. Ian
On 01/09/2012 18:11, Doug Kingston wrote:
On 1 September 2012 09:46, Jared Camins-Esakov <jcamins@cpbibliography.com>**wrote:
Doug,
So environment variables are not the issue. We are carefully managing
those.
Make sure when you are using cron jobs that you set the environment variables IN YOUR CRONTAB. Setting environment variables elsewhere is a recipe for confusion and misery down the road. However, this is -- as you say -- not the problem.
6566) and it indeed finds a few recent biblios that are not indexed. Using the -z option to mark them for indexing followed by a manual run of rebuild_zebra -b -v -z did not get the biblios indexed. I cranked up the debugging on zebraidx (by modifying rebuild_zebra.pl and using -v -v) and did not see any obvious errors in the output that would suggest why indexing was failing.
Did you change your bibliographic frameworks? It could be a matter of
I have tried using the new tool checkNonIndexedBiblios.pl (from patch the biblionumber not being stored properly. The other thing to do is to confirm that the non-indexed biblios are *actually* getting added to the zebraqueue by the 6566 script. It's kind of a long shot, but it could be an issue with the zebraqueue table getting corrupted. I've seen this happen when the zebraqueue table got too large, and disk space was low.
So I think this is working as expected. Disk space is ample on the
system in question, and the catalogue is small by most standards (about 2500 biblios). I ran rebuild_zebra.pl with the -k flag so it left the exported records and here's the tree I got.
library:/tmp# ls -altR p6tjtKrrK3/ p6tjtKrrK3/: total 0 drwxrwxrwt 6 root root 1040 Sep 1 17:50 .. drwx------ 5 koha koha 100 Sep 1 06:36 . drwxr-xr-x 2 koha koha 60 Sep 1 06:36 upd_biblio drwxr-xr-x 2 koha koha 60 Sep 1 06:36 del_biblio drwxr-xr-x 2 koha koha 40 Sep 1 06:36 biblio
p6tjtKrrK3/upd_biblio: total 16 -rw-r--r-- 1 koha koha 12670 Sep 1 06:36 exported_records drwxr-xr-x 2 koha koha 60 Sep 1 06:36 . drwx------ 5 koha koha 100 Sep 1 06:36 ..
p6tjtKrrK3/del_biblio: total 0 drwx------ 5 koha koha 100 Sep 1 06:36 .. drwxr-xr-x 2 koha koha 60 Sep 1 06:36 . -rw-r--r-- 1 koha koha 0 Sep 1 06:36 exported_records
p6tjtKrrK3/biblio: total 0 drwx------ 5 koha koha 100 Sep 1 06:36 .. drwxr-xr-x 2 koha koha 40 Sep 1 06:36 .
Using marcprint.py, a small python program built around pymarc package, I decoded this file and find 13 MARC records, as expected. Example: =LDR 00871nam a22002417a 4500 =001 201112071555.ls =003 UkLoVW =005 20111209110116.0 =008 111207t1982\\\\enkg\\\\r\\\\\**001\0\eng\d =040 \\$aUkLoVW$cUkLoVW =099 \\$aQS 40 =100 1\$aSheffield, Ken$92330 =245 \0$aTen country dances :$bmainly from Thompson, Wright & Wilson. =260 \\$aOxford :$b[The Author],$c1982. =300 \\$a12 p. :$bmusic ;$c30 cm. =490 1\$aFrom two barns ;$vv. 1 =650 \\$9117$aCountry dances =650 \\$9127$aDance music =830 \5$aFrom two barns$92331 =942 \\$2VWML$cBK$hQS 40$n0$6QS_00040 =999 \\$c14879$d14879 =952 \\$w2011-12-07$p10914$r2011-**12-07$40$00$6QS_00040$915083$** bVWML$10$oQS 40$d2011-12-07$70$cBOX$2VWML$**yBK$aVWML =952 \\$w2011-12-07$p11121$r2011-**12-07$40$00$6QS_00040$915084$** bVWML$10$oQS 40$d2011-12-07$71$cBOX$2VWML$**yBK$aVWML
I have attached an ascii printout of all 13 records in case someone wants to look for a pattern in these records.
The problem is either in the format/contents of those records, or in zebraidx/zebrasrv or their config files. My suspicion is with the later since we have already had to fix one problem there with for bug 6566.
-Doug-
Regards,
Jared
-- Jared Camins-Esakov Bibliographer, C & P Bibliography Services, LLC (phone) +1 (917) 727-3445 (e-mail) jcamins@cpbibliography.com (web) http://www.cpbibliography.com/
______________________________**_________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/**mailman/listinfo/koha<http://lists.katipo.co.nz/mailman/listinfo/koha>
-- Ian Bays Director of Projects, PTFS Europe Limited Content Management and Library Solutions +44 (0) 800 756 6803 (phone) +44 (0) 7774 995297 (mobile) +44 (0) 800 756 6384 (fax) skype: ian.bays email: ian.bays@ptfs-europe.com
______________________________**_________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/**mailman/listinfo/koha<http://lists.katipo.co.nz/mailman/listinfo/koha>
Greetings, I thought I'd interject a bit.
It would be good to build a tool to find duplicate control numbers. I did this by exporting all the biblios, using: marcprint (my python utility) | grep "=001" | sort | uniq -c | sort -r | less and looked for counts greater than 1.
I'll use your "marcprint" in my example, though I suspect exporting a MARC file would be useful enough, if the MARC file is then converted to something "human readable". Under Windows, I would likely used MarcEdit to "break" the .mrc file into a .mrk file. http://people.oregonstate.edu/~reeset/marcedit/html/ And then having uploaded my .mrk files into a linux environment, substitute "marcprint" with "cat mymarcfile.mrk". All this uploading got me thinking that perhaps something like: SELECT ExtractValue(marcxml,'//datafield[@tag="001"]/*') AS Control FROM biblioitems But I didn't take it further than this, since I don't have time. NOTATION: # is a comment. $ is a command line prompt # Get a list of unique "=001" fields, should be one per record, right? $ marcprint | grep "=001" | sort -u -r > ~/check1.txt # Get a list of all the "=001" fields. $ marcprint | grep "=001" | sort -r > ~/check2.txt # Compare the two. Any differences will be due to duplications. $ diff ~/check1.txt ~/check2.txt GPML, Mark Tompsett
Better yet, we could use 'uniq -d'. It only prints out the duplicate lines, so the following should suffice: marcprint | grep "=001" | sort | uniq -d I agree we should also be able to do something with SQL or SQL/Perl pretty easily, but this was quick. -Doug On 1 September 2012 17:26, Mark Tompsett <mtompset@hotmail.com> wrote:
Greetings,
I thought I'd interject a bit.
It would be good to build a tool to find duplicate control numbers.
I did this by exporting all the biblios, using: marcprint (my python utility) | grep "=001" | sort | uniq -c | sort -r | less and looked for counts greater than 1.
I'll use your "marcprint" in my example, though I suspect exporting a MARC file would be useful enough, if the MARC file is then converted to something "human readable". Under Windows, I would likely used MarcEdit to "break" the .mrc file into a .mrk file. http://people.oregonstate.edu/**~reeset/marcedit/html/<http://people.oregonstate.edu/~reeset/marcedit/html/> And then having uploaded my .mrk files into a linux environment, substitute "marcprint" with "cat mymarcfile.mrk". All this uploading got me thinking that perhaps something like: SELECT ExtractValue(marcxml,'//**datafield[@tag="001"]/*') AS Control FROM biblioitems But I didn't take it further than this, since I don't have time.
NOTATION: # is a comment. $ is a command line prompt
# Get a list of unique "=001" fields, should be one per record, right? $ marcprint | grep "=001" | sort -u -r > ~/check1.txt # Get a list of all the "=001" fields. $ marcprint | grep "=001" | sort -r > ~/check2.txt # Compare the two. Any differences will be due to duplications. $ diff ~/check1.txt ~/check2.txt
GPML, Mark Tompsett
At 11:32 PM 9/1/2012 +0100, Ian Bays wrote:
Hi. The 3.8 upgrade offers the dom indexing by default and if you have taken that option (as seen in $KOHA_CONF) the xsl used instead of record.abs (~/koha-dev/etc/zebradb/marc_defs/marc21/biblios/biblio-zebra-indexdefs.xsl) uses a construct (z:id) for the 001 which uses that (if it exists) as the zebra unique id. This means if you have more than one bib record with the same 001 (as you get if you duplicate a bib for instance) it will only index the last one and it won't complain at all about it.
If this is the case, then there is a problem (bug?) as the 001 is not a unique field by definition (this from our cataloguers who tell me it's only unique when taken in conjunction with the 003; they referred me to <http://www.loc.gov/marc/authority/ad001.html> which seems to confirm, but I would appreciate other views on this.) We certainly have many duplicates -- part of this is intentional (e.g. facsimile reprints and different titles for intrinsically identical books in UK and US editions[1]); part is "accidental" occurring after importing Z39.50 records from different libraries.
[snip] The better solution is to fix the xsl to probably not use the z:id for biblios or maybe get it to use the 999$c, but the zebra config scares me.
If my understanding is correct, the 999$c (biblio.biblionumber) is designed to be unique and I had _assumed_ that this was expressly for db indexing regardless of whether it's MySQL or Zebra. Our sandbox is tied up at the moment, but tomorrow I should be able to get some time to do a detailed comparison of 3.6 with 3.8. However if I'm not correct in my assumption, maybe someone would be kind enough to avoid my going on a wild goose chase. Thanks - Paul [1] -- I understand that this may not be everyone's choice and that other options may be available, but also that it is perfectly acceptable as long as we "sign" the 003 as CaOPIACS; other major libraries, so I'm told, do the same.
I would concur that 001 and 003 need to be taken together to have any chance of a unique identifier. Our library has our own unique 003 (UkLoVW). For indexing, I suspect a better key to hand to zebra would be the system control number (035 $a). Are these kept unique by koha? While I do believe this is a problem, I am not sure how this would the cause of our ModAuthority failures however, unless I am missing something. See the other mail thread for details (3.8.4 Error message when editing authorities). Its possible these issues are related but I am not sure. -Doug- On 3 September 2012 09:10, Paul <paul.a@aandc.org> wrote:
At 11:32 PM 9/1/2012 +0100, Ian Bays wrote:
Hi. The 3.8 upgrade offers the dom indexing by default and if you have taken that option (as seen in $KOHA_CONF) the xsl used instead of record.abs (~/koha-dev/etc/zebradb/marc_**defs/marc21/biblios/biblio-**zebra-indexdefs.xsl) uses a construct (z:id) for the 001 which uses that (if it exists) as the zebra unique id. This means if you have more than one bib record with the same 001 (as you get if you duplicate a bib for instance) it will only index the last one and it won't complain at all about it.
If this is the case, then there is a problem (bug?) as the 001 is not a unique field by definition (this from our cataloguers who tell me it's only unique when taken in conjunction with the 003; they referred me to < http://www.loc.gov/marc/**authority/ad001.html<http://www.loc.gov/marc/authority/ad001.html>> which seems to confirm, but I would appreciate other views on this.)
We certainly have many duplicates -- part of this is intentional (e.g. facsimile reprints and different titles for intrinsically identical books in UK and US editions[1]); part is "accidental" occurring after importing Z39.50 records from different libraries.
[snip] The better solution is to fix the xsl to probably not use the z:id
for biblios or maybe get it to use the 999$c, but the zebra config scares me.
If my understanding is correct, the 999$c (biblio.biblionumber) is designed to be unique and I had _assumed_ that this was expressly for db indexing regardless of whether it's MySQL or Zebra.
Our sandbox is tied up at the moment, but tomorrow I should be able to get some time to do a detailed comparison of 3.6 with 3.8. However if I'm not correct in my assumption, maybe someone would be kind enough to avoid my going on a wild goose chase.
Thanks - Paul [1] -- I understand that this may not be everyone's choice and that other options may be available, but also that it is perfectly acceptable as long as we "sign" the 003 as CaOPIACS; other major libraries, so I'm told, do the same.
______________________________**_________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/**mailman/listinfo/koha<http://lists.katipo.co.nz/mailman/listinfo/koha>
Doug, I would concur that 001 and 003 need to be taken together to have any
chance of a unique identifier. Our library has our own unique 003 (UkLoVW). For indexing, I suspect a better key to hand to zebra would be the system control number (035 $a). Are these kept unique by koha?
We already have a unique identifier, the biblionumber. It's stored in 999$c. The bug in this case is that the DOM indexing uses the contents in 001 if you have it populated.
While I do believe this is a problem, I am not sure how this would the cause of our ModAuthority failures however, unless I am missing something. See the other mail thread for details (3.8.4 Error message when editing authorities). Its possible these issues are related but I am not sure.
The ZOOM error you reported might have something to do with this, but I just don't know. I don't think I've ever seen problems with ZOOM element (as opposed to attribute) before. Regards, Jared -- Jared Camins-Esakov Bibliographer, C & P Bibliography Services, LLC (phone) +1 (917) 727-3445 (e-mail) jcamins@cpbibliography.com (web) http://www.cpbibliography.com/
Biblio number is only relevant for bibs. What should be used for authorities? That is what I was referring to below but the problem may be common to both objects. -Doug- On Sep 3, 2012 9:49 AM, "Jared Camins-Esakov" <jcamins@cpbibliography.com> wrote:
Doug,
I would concur that 001 and 003 need to be taken together to have any
chance of a unique identifier. Our library has our own unique 003 (UkLoVW). For indexing, I suspect a better key to hand to zebra would be the system control number (035 $a). Are these kept unique by koha?
We already have a unique identifier, the biblionumber. It's stored in 999$c. The bug in this case is that the DOM indexing uses the contents in 001 if you have it populated.
While I do believe this is a problem, I am not sure how this would the cause of our ModAuthority failures however, unless I am missing something. See the other mail thread for details (3.8.4 Error message when editing authorities). Its possible these issues are related but I am not sure.
The ZOOM error you reported might have something to do with this, but I just don't know. I don't think I've ever seen problems with ZOOM element (as opposed to attribute) before.
Regards, Jared
-- Jared Camins-Esakov Bibliographer, C & P Bibliography Services, LLC (phone) +1 (917) 727-3445 (e-mail) jcamins@cpbibliography.com (web) http://www.cpbibliography.com/
Doug, Biblio number is only relevant for bibs. What should be used for
authorities? That is what I was referring to below but the problem may be common to both objects.
Authid is already stored in 001 in authority records. The problem that Ian reports is relevant specifically and only to bib records.
Regards, Jared -- Jared Camins-Esakov Bibliographer, C & P Bibliography Services, LLC (phone) +1 (917) 727-3445 (e-mail) jcamins@cpbibliography.com (web) http://www.cpbibliography.com/
Hi to all,
I would concur that 001 and 003 need to be taken together to have any
chance of a unique identifier. Our library has our own unique 003 (UkLoVW). For indexing, I suspect a better key to hand to zebra would be the system control number (035 $a). Are these kept unique by koha?
in fact, reading the bib marc21 standard, http://www.loc.gov/marc/bibliographic/bd001.html I think to understand that the value in 001 must to be unique in your instance of Koha. If you have an official LoC code of your library, with 001 + 003 you have a code unique at world level. In fact the value of 003 is always the same for all bib records of your Koha instance. The defintion of 001 in Unimarc is more clear "This field contains characters uniquely associated with the record, i.e. the control number for the record of the agency preparing the record.". There is not equivalent of 003 in Unimarc. In fact during input in Koha is possible to create a duplicate or a empty value in 001. Bye Zeno Tajoli
At 10:08 PM 9/3/2012 +0200, tajoli@cilea.it wrote: [snip]
In fact the value of 003 is always the same for all bib records of your Koha instance.
My understanding is that it should not be _your_ library unless _you_ created the 001 control number. As I said earlier, there are other choices and options, but if you retain e.g. the British Library control number in 001, you retain "Uk" in 003: Which is why I'm a tad confused by Jared's comment that "Authid is already stored in 001 in authority records." Maybe this means "Koha uses 001 as Authid"??? We have hundreds of imported "authorities" in our production 3.6.1 and all authority searches seem to be fine (mind you we do check regularly for duplicate authorities -- this is one point that I do not have a good handle on. If our cataloguers "forget" to see if an authority already exists and "choose" it if it does, we systematically end up with duplicate auths.) Here's an example from our catalogue using authorities/detail.pl for a Library of Congress Z39.50 transfer -- a single author with 11 biblios and 19 items: 000 - LEADER @ 00394nz a2200133o 4500 001 - CONTROL NUMBER @ 2005057987 003 - CONTROL NUMBER IDENTIFIER @ DLC 005 - DATE AND TIME OF LATEST TRANSACTION @ 20110813090406.0 008 - FIXED-LENGTH DATA ELEMENTS @ 110310|||a|||||| | ||| d 040 ## - CATALOGING SOURCE a Original cataloging DLC c Transcribing agency OPIACS Best - Paul
On 2012-09-4, at 7:34 AM, Doug Kingston wrote:
Biblio number is only relevant for bibs. What should be used for authorities?
fyi Doug,,, all authority records have a unique auth_number value, (thats used in the same way as a bib's biblionumber value)
That is what I was referring to below but the problem may be common to both objects.
-Doug-
On Sat, Sep 1, 2012 at 5:46 PM, Jared Camins-Esakov <jcamins@cpbibliography.com> wrote:
Did you change your bibliographic frameworks? It could be a matter of the biblionumber not being stored properly.
jared, No changes to the frameworks. Some of these records were created quite some time ago. I couldn't swear to it, but I'm pretty sure I remember seeing them in the catalog before this latest update. Though I'm not the person who usually checks the catalogers' work. Weird coincidence or perhaps a clue? nearly all the records that aren't indexing were created by the same cataloger. Perhaps it is something this person did consistently. But what? Elaine
Jared Camins-Esakov Bibliographer, C & P Bibliography Services, LLC (phone) +1 (917) 727-3445 (e-mail) jcamins@cpbibliography.com (web) http://www.cpbibliography.com/ _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/mailman/listinfo/koha
-- Elaine Bradtke Data Wrangler VWML English Folk Dance and Song Society | http://www.efdss.org Cecil Sharp House, 2 Regent's Park Road, London NW1 7AY Tel +44 (0) 20 7485 2206 (This number is for the English Folk Dance and Song Society in London, England. If you wish to phone me personally, send an e-mail first. I work off site) -------------------------------------------------------------------------- Registered Company No. 297142 Charity Registered in England and Wales No. 305999 --------------------------------------------------------------------------- "Writing about music is like dancing about architecture" --Elvis Costello (Musician magazine No. 60 (October 1983), p. 52)
participants (8)
-
Doug Kingston -
Elaine Bradtke -
Ian Bays -
Jared Camins-Esakov -
Mark Tompsett -
Mason James -
Paul -
tajoli@cilea.it