Hi All, We run Koha 3.02.00.004 and we have a problem with rebuild_zebra. In some circumstances, when we load and update records, double index entries corresponding to the same record (same biblionumber) appear when we search the catalog. The first entry is the old index entry and the second for the new state or the record. It's a big catalog : more than 900 000 bib records. If we run ./rebuild_zebra.pl -b -r -x -v, double entries disappear. I don't see anything strange in our settings. Does anybody have any idea on what happens and if there is a mistake in our settings or a misunderstanding ? Every suggestion is appreciated. Thanks -- Bernard Desnoues Bibliothécaire Bibliothèque universitaire des langues et civilisations Pôle informatique 60, rue de Wattignies - 75012 Paris Tél. (+33) 01 53 46 15 57 Fax. (+33) 01 53 46 15 90 http://www.bulac.fr bernard.desnoues@bulac.fr
Bernard, Please see: http://koha-community.org/documentation/faq/searching/#5 And let us know if any of those suggestions help at all. Good luck! Liz Rea NEKLS On Mar 9, 2011, at 7:13 AM, Bernard Desnoues wrote:
Hi All,
We run Koha 3.02.00.004 and we have a problem with rebuild_zebra. In some circumstances, when we load and update records, double index entries corresponding to the same record (same biblionumber) appear when we search the catalog. The first entry is the old index entry and the second for the new state or the record. It's a big catalog : more than 900 000 bib records. If we run ./rebuild_zebra.pl -b -r -x -v, double entries disappear.
I don't see anything strange in our settings. Does anybody have any idea on what happens and if there is a mistake in our settings or a misunderstanding ?
Every suggestion is appreciated.
Thanks -- Bernard Desnoues
Bibliothécaire Bibliothèque universitaire des langues et civilisations
Pôle informatique
60, rue de Wattignies - 75012 Paris Tél. (+33) 01 53 46 15 57 Fax. (+33) 01 53 46 15 90
bernard.desnoues@bulac.fr _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/mailman/listinfo/koha
Hello Liz, I had run these command lines with no result. Thanks Bernard Liz Rea a écrit :
Bernard, Please see: http://koha-community.org/documentation/faq/searching/#5
<http://koha-community.org/documentation/faq/searching/#5>And let us know if any of those suggestions help at all.
Good luck!
Liz Rea NEKLS On Mar 9, 2011, at 7:13 AM, Bernard Desnoues wrote:
Hi All,
We run Koha 3.02.00.004 and we have a problem with rebuild_zebra. In some circumstances, when we load and update records, double index entries corresponding to the same record (same biblionumber) appear when we search the catalog. The first entry is the old index entry and the second for the new state or the record. It's a big catalog : more than 900 000 bib records. If we run ./rebuild_zebra.pl -b -r -x -v, double entries disappear.
I don't see anything strange in our settings. Does anybody have any idea on what happens and if there is a mistake in our settings or a misunderstanding ?
Every suggestion is appreciated.
Thanks -- Bernard Desnoues
Bibliothécaire Bibliothèque universitaire des langues et civilisations
Pôle informatique
60, rue de Wattignies - 75012 Paris Tél. (+33) 01 53 46 15 57 Fax. (+33) 01 53 46 15 90
bernard.desnoues@bulac.fr _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/mailman/listinfo/koha
-- Bernard Desnoues Bibliothécaire Bibliothèque universitaire des langues et civilisations Pôle informatique 60, rue de Wattignies - 75012 Paris Tél. (+33) 01 53 46 15 57 Fax. (+33) 01 53 46 15 90 http://www.bulac.fr bernard.desnoues@bulac.fr
We run Koha 3.02.00.004 and we have a problem with rebuild_zebra. In some circumstances, when we load and update records, double index entries corresponding to the same record (same biblionumber) appear when we search the catalog. The first entry is the old index entry and the second for the new state or the record. It's a big catalog : more than 900 000 bib records. If we run ./rebuild_zebra.pl -b -r -x -v, double entries disappear.
I don't see anything strange in our settings. Does anybody have any idea on what happens and if there is a mistake in our settings or a misunderstanding ?
At which periodicity do you run rebuild_zebra.pl? With such a large catalog you may run an instance of the script whith a previous one still running. You could try to run rebuild_zebra less frequently. If your records are 'clean', you can also run rebuild_zebra with 'nosanitize' option. It will significantly decrease rebuild_zebra execution time. Regards, -- Frédéric DEMIANS http://www.tamil.fr/u/fdemians.html
2011/3/11 Frédéric Demians <frederic@tamil.fr>:
If your records are 'clean', you can also run rebuild_zebra with 'nosanitize' option. It will significantly decrease rebuild_zebra execution time.
Frédéric, Could you please give a bit more detail on this, as in define "clean" as you use it here? Thanks, Hans
Could you please give a bit more detail on this, as in define "clean" as you use it here?
rebuild_zebra.pl works in two stages: (1) export all/queued records to a file; (2) gives the exported file to Zebra indexer (zebraidx command). -nosanitize option modify the first stage. Without this option, during stage 1, records are 'sanitized' before being outputted in the file, ie their leader is fixed, biblionumber is checked, UNIMARC tag 100 is forced to UTF-8, and few other things. This 'sanitizing' requires to read records, parse them into a Perl object, manipulate the object, and finally format it back into XML. This consumes CPU/memory resource, and take time. With -nosanitize option, records are read from MySQL, and directly written in the export file. It decreases drastically the time rebuild_zebra.pl spend in stage 1. In this perspective, a 'clean' record is a record which doesn't need to be sanitized: leader ok, correct record id, etc. By the way, coming back to the initial question, it could be interesting also to improve performance of stage 2, so improving Zebra index raw performances.
2011/3/11 Frédéric Demians <frederic@tamil.fr>:
In this perspective, a 'clean' record is a record which doesn't need to be sanitized: leader ok, correct record id, etc.
Thanks, that's very helpful. If you're willing to get into it a bit more: is there any way to run a one-time check on the database to determine whether or not it is clean, or if it turns out to not be "dirty", a way to re-import the data once it's been cleaned by the script? Or alternatively, if I'm just now loading in all my data from MARC records, can I assume it is clean to start with? What sort of things cause the data to get "dirty", or conversely, are there maintenance tasks that can be run to ensure it stays clean? In other words, how can a Koha admin ensure that the much faster -nosanitize option could be used, and therefore re-indexing run more frequently? This would be very helpful especially during a period when a lot of cataloging is being done. Sorry for being such a pest; I realize it's a lot to ask and a bit off-topic, feel free to just ignore if you haven't the time to go into at the moment 8-).
Hello Frédéric, Frédéric Demians a écrit :
We run Koha 3.02.00.004 and we have a problem with rebuild_zebra. In some circumstances, when we load and update records, double index entries corresponding to the same record (same biblionumber) appear when we search the catalog. The first entry is the old index entry and the second for the new state or the record. It's a big catalog : more than 900 000 bib records. If we run ./rebuild_zebra.pl -b -r -x -v, double entries disappear.
I don't see anything strange in our settings. Does anybody have any idea on what happens and if there is a mistake in our settings or a misunderstanding ?
We run Zebra (-z) every 2 min but when we load records, this cronjob is not active. Thanks Bernard
At which periodicity do you run rebuild_zebra.pl? With such a large catalog you may run an instance of the script whith a previous one still running. You could try to run rebuild_zebra less frequently. If your records are 'clean', you can also run rebuild_zebra with 'nosanitize' option. It will significantly decrease rebuild_zebra execution time.
Regards,
-- Bernard Desnoues Bibliothécaire Bibliothèque universitaire des langues et civilisations Pôle informatique 60, rue de Wattignies - 75012 Paris Tél. (+33) 01 53 46 15 57 Fax. (+33) 01 53 46 15 90 http://www.bulac.fr bernard.desnoues@bulac.fr
participants (4)
-
Bernard Desnoues -
Frédéric Demians -
hansbkk@gmail.com -
Liz Rea