Problems with Zebra index

Oliver Goldschmidt

23 Jul 2013 23 Jul '13

9:29 a.m.

Hi Koha community, I am new to Koha and have spent the last week with trying to feed the Zebra index with our bibliographic records. This turned out to be pretty difficult. I have successfully imported our records (about 600.000) to the Koha database. Then I tried to use rebuild_zebra.pl to put the records into the index. This failed due to disk space reasons: I have 100 GB disk space reserved for the Zebra index (mounted on /var/lib/koha) and have split this space in zebra config into 45 GB for the shadow directory and 45 GB for the register directory. This was not sufficient, which I think is a little bit weired, because I think 600.000 records should not take so much space... So, my first question: is that normal? Does Zebra need so much disk space for the index? What are the directories register and shadow exactly for? Next try was indexing with rebuild_zebra_sliced.sh. I used the default value of 10000 for the chunks. First I got an error, I guess because a configuration value was not set properly (the script did not find index_mode; so I set it manually to "dom", which I guessed should be the correct value for indexing marcxml). After fixing that manually, I succeeded to split my export file into 59 10000-record-chunks. I tried to index the first two chunks and that seemed to work without problems for the first chunk (but it finished very fast, which made me wonder if Koha really did something - I just realized, that the marcxml file was not valid - but why didn't I get an error?). For the second chunk, there were two messages (unfortunaltely I cannot recall them). This is the command I used to do that: zebraidx -c /etc/koha/sites/koha/zebra-biblios.cfg -v none,fatal,warn -g marcxml -d biblios update /tmp/rebuild/export/biblio/exported_records_1000001 But now, when I search in the Koha opac for an "e" for example, I still get no results. Though the index seems to be empty, but actually there are files in /var/lib/koha/koha/biblio/shadow. Is there a way to look into the Zebra index directly? I have no idea where to look next. Does anybody have any hint about that? Any help would be appreciated. Best -Oliver -- Oliver Goldschmidt TU Hamburg-Harburg / Universitätsbibliothek / Digitale Dienste Denickestr. 22 21071 Hamburg - Harburg Tel. +49 (0)40 / 428 78 - 32 91 eMail o.goldschmidt@tuhh.de -- GPG/PGP-Schlüssel: http://www.tub.tu-harburg.de/keys/Oliver_Marahrens_pub.asc

Show replies by date

Oliver Goldschmidt

23 Jul 23 Jul

11:12 a.m.

I retried indexing the first chunk after I corrected the XML. So here are the warning messages I got from zebraidx: 10:45:10-23/07 zebraidx(14513) [warn] Couldn't open collection.abs [No such file or directory] 11:11:49-23/07 zebraidx(14513) [warn] Record didn't contain match fields in (bib1,Local-number) The first one appeared when I started indexing, the second appaered when indexing ended. I still do not find anything, when I search in the Koha opac for an "e". Best - Oliver Am 23.07.2013 10:29, schrieb Oliver Goldschmidt:

...

Hi Koha community,

I am new to Koha and have spent the last week with trying to feed the Zebra index with our bibliographic records. This turned out to be pretty difficult. I have successfully imported our records (about 600.000) to the Koha database. Then I tried to use rebuild_zebra.pl to put the records into the index. This failed due to disk space reasons: I have 100 GB disk space reserved for the Zebra index (mounted on /var/lib/koha) and have split this space in zebra config into 45 GB for the shadow directory and 45 GB for the register directory. This was not sufficient, which I think is a little bit weired, because I think 600.000 records should not take so much space... So, my first question: is that normal? Does Zebra need so much disk space for the index? What are the directories register and shadow exactly for?

Next try was indexing with rebuild_zebra_sliced.sh. I used the default value of 10000 for the chunks. First I got an error, I guess because a configuration value was not set properly (the script did not find index_mode; so I set it manually to "dom", which I guessed should be the correct value for indexing marcxml). After fixing that manually, I succeeded to split my export file into 59 10000-record-chunks. I tried to index the first two chunks and that seemed to work without problems for the first chunk (but it finished very fast, which made me wonder if Koha really did something - I just realized, that the marcxml file was not valid - but why didn't I get an error?). For the second chunk, there were two messages (unfortunaltely I cannot recall them). This is the command I used to do that:

zebraidx -c /etc/koha/sites/koha/zebra-biblios.cfg -v none,fatal,warn -g marcxml -d biblios update /tmp/rebuild/export/biblio/exported_records_1000001

But now, when I search in the Koha opac for an "e" for example, I still get no results. Though the index seems to be empty, but actually there are files in /var/lib/koha/koha/biblio/shadow. Is there a way to look into the Zebra index directly? I have no idea where to look next.

Does anybody have any hint about that? Any help would be appreciated.

Best -Oliver

-- Oliver Goldschmidt TU Hamburg-Harburg / Universitätsbibliothek / Digitale Dienste Denickestr. 22 21071 Hamburg - Harburg Tel. +49 (0)40 / 428 78 - 32 91 eMail o.goldschmidt@tuhh.de -- GPG/PGP-Schlüssel: http://www.tub.tu-harburg.de/keys/Oliver_Marahrens_pub.asc

Jared Camins-Esakov

12:38 p.m.

Oliver, I have good news and bad news. The good news is, the fix to your problem is probably easy. The bad news is that running the zebraidx command manually more than likely messed up your installation. It sounds like your first problem can be solved simply by increasing the space that Zebra will use (it is not uncommon to need in excessive of 100GB for indexes in a large installation). I'm not sure how you increased the space allotted, so I'm going to provide instructions for the correct way to do this that you can check your work against. If you open up the zebra-biblios.cfg and zebra-biblios-dom.cfg files that Koha installed (in /etc/koha/sites/koha/ ), you'll need to change two lines, the lines starting with register and shadow. At the end of the line it says 20G or 45G, depending whether you changed that. Change those numbers to, say, 80G. rebuild_zebra_sliced would not help you in this instance, because your problem is the amount of disk space required, not a bad record. Now for the bad news. If you ran zebraidx as any user other than koha-koha, your permissions are going to be all wrong. You can try changing the owner recursively on /var/lib/koha/koha to koha-koha. That might fix it (but I am not sure, since I haven't tried). The zebra_bib_index_mode is easy to fix, fortunately. Just change zebra_bib_index_mode to grs1, run rebuild_zebra.pl-r -b -x and you should be fine. You can worry about switching to DOM indexing once you have indexing with GRS-1 working: http://wiki.koha-community.org/wiki/Switching_to_dom_indexing Regards, Jared On Tue, Jul 23, 2013 at 4:29 AM, Oliver Goldschmidt <o.goldschmidt@tuhh.de>wrote:

...

Hi Koha community,

I am new to Koha and have spent the last week with trying to feed the Zebra index with our bibliographic records. This turned out to be pretty difficult. I have successfully imported our records (about 600.000) to the Koha database. Then I tried to use rebuild_zebra.pl to put the records into the index. This failed due to disk space reasons: I have 100 GB disk space reserved for the Zebra index (mounted on /var/lib/koha) and have split this space in zebra config into 45 GB for the shadow directory and 45 GB for the register directory. This was not sufficient, which I think is a little bit weired, because I think 600.000 records should not take so much space... So, my first question: is that normal? Does Zebra need so much disk space for the index? What are the directories register and shadow exactly for?

Next try was indexing with rebuild_zebra_sliced.sh. I used the default value of 10000 for the chunks. First I got an error, I guess because a configuration value was not set properly (the script did not find index_mode; so I set it manually to "dom", which I guessed should be the correct value for indexing marcxml). After fixing that manually, I succeeded to split my export file into 59 10000-record-chunks. I tried to index the first two chunks and that seemed to work without problems for the first chunk (but it finished very fast, which made me wonder if Koha really did something - I just realized, that the marcxml file was not valid - but why didn't I get an error?). For the second chunk, there were two messages (unfortunaltely I cannot recall them). This is the command I used to do that:

zebraidx -c /etc/koha/sites/koha/zebra-biblios.cfg -v none,fatal,warn -g marcxml -d biblios update /tmp/rebuild/export/biblio/exported_records_1000001

But now, when I search in the Koha opac for an "e" for example, I still get no results. Though the index seems to be empty, but actually there are files in /var/lib/koha/koha/biblio/shadow. Is there a way to look into the Zebra index directly? I have no idea where to look next.

Does anybody have any hint about that? Any help would be appreciated.

Best -Oliver

-- Oliver Goldschmidt TU Hamburg-Harburg / Universitätsbibliothek / Digitale Dienste Denickestr. 22 21071 Hamburg - Harburg Tel. +49 (0)40 / 428 78 - 32 91 eMail o.goldschmidt@tuhh.de -- GPG/PGP-Schlüssel: http://www.tub.tu-harburg.de/keys/Oliver_Marahrens_pub.asc

_______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz http://lists.katipo.co.nz/mailman/listinfo/koha

-- Jared Camins-Esakov Bibliographer, C & P Bibliography Services, LLC (phone) +1 (917) 727-3445 (e-mail) jcamins@cpbibliography.com (web) http://www.cpbibliography.com/

Oliver Goldschmidt

1:30 p.m.

Jared, thank you very much for your reply! In fact I forgot to run zebraidx as user koha-koha, and so you were right: I had bad permissions on the index files in shadow. I tried to fix that by changing the ownership and restarted zebra, but that had no effect. So I guess you are right, that I messed up my index by trying that (which is not too bad; I can still remove the index and try again as koha-koha). I hope nothing else broke by that mistake, and the database is still fine?! To increase the disk space, thats exactly what I did: I changed the value in zebra-biblios.cfg. But can you explain, what the directories are used for? After finishing indexing, will I have data in both directories or could I configure my 100 GB disk, so that both directories can take 80 GB space? I will try that and see... I still have a problem with rebuild_zebra.pl: it ignores the -s parameter. If I understood that right, rebuild_zebra should use an existing exported_records file, if I use the parameter -s and -d. But it doesn't. Any time I'm starting rebuild_zebra, the script exports my database (this takes pretty much time and I wanted to bypass it). Is this a bug or am I missing anything? Best - Oliver Am 23.07.2013 13:38, schrieb Jared Camins-Esakov:

...

Oliver,

I have good news and bad news. The good news is, the fix to your problem is probably easy. The bad news is that running the zebraidx command manually more than likely messed up your installation.

It sounds like your first problem can be solved simply by increasing the space that Zebra will use (it is not uncommon to need in excessive of 100GB for indexes in a large installation). I'm not sure how you increased the space allotted, so I'm going to provide instructions for the correct way to do this that you can check your work against. If you open up the zebra-biblios.cfg and zebra-biblios-dom.cfg files that Koha installed (in /etc/koha/sites/koha/ ), you'll need to change two lines, the lines starting with register and shadow. At the end of the line it says 20G or 45G, depending whether you changed that. Change those numbers to, say, 80G.

rebuild_zebra_sliced would not help you in this instance, because your problem is the amount of disk space required, not a bad record.

Now for the bad news. If you ran zebraidx as any user other than koha-koha, your permissions are going to be all wrong. You can try changing the owner recursively on /var/lib/koha/koha to koha-koha. That might fix it (but I am not sure, since I haven't tried). The zebra_bib_index_mode is easy to fix, fortunately. Just change zebra_bib_index_mode to grs1, run rebuild_zebra.pl <http://rebuild_zebra.pl> -r -b -x and you should be fine. You can worry about switching to DOM indexing once you have indexing with GRS-1 working: http://wiki.koha-community.org/wiki/Switching_to_dom_indexing

Regards, Jared

On Tue, Jul 23, 2013 at 4:29 AM, Oliver Goldschmidt <o.goldschmidt@tuhh.de <mailto:o.goldschmidt@tuhh.de>> wrote:

Hi Koha community,

I am new to Koha and have spent the last week with trying to feed the Zebra index with our bibliographic records. This turned out to be pretty difficult. I have successfully imported our records (about 600.000) to the Koha database. Then I tried to use rebuild_zebra.pl <http://rebuild_zebra.pl> to put the records into the index. This failed due to disk space reasons: I have 100 GB disk space reserved for the Zebra index (mounted on /var/lib/koha) and have split this space in zebra config into 45 GB for the shadow directory and 45 GB for the register directory. This was not sufficient, which I think is a little bit weired, because I think 600.000 records should not take so much space... So, my first question: is that normal? Does Zebra need so much disk space for the index? What are the directories register and shadow exactly for?

Next try was indexing with rebuild_zebra_sliced.sh. I used the default value of 10000 for the chunks. First I got an error, I guess because a configuration value was not set properly (the script did not find index_mode; so I set it manually to "dom", which I guessed should be the correct value for indexing marcxml). After fixing that manually, I succeeded to split my export file into 59 10000-record-chunks. I tried to index the first two chunks and that seemed to work without problems for the first chunk (but it finished very fast, which made me wonder if Koha really did something - I just realized, that the marcxml file was not valid - but why didn't I get an error?). For the second chunk, there were two messages (unfortunaltely I cannot recall them). This is the command I used to do that:

zebraidx -c /etc/koha/sites/koha/zebra-biblios.cfg -v none,fatal,warn -g marcxml -d biblios update /tmp/rebuild/export/biblio/exported_records_1000001

But now, when I search in the Koha opac for an "e" for example, I still get no results. Though the index seems to be empty, but actually there are files in /var/lib/koha/koha/biblio/shadow. Is there a way to look into the Zebra index directly? I have no idea where to look next.

Does anybody have any hint about that? Any help would be appreciated.

Best -Oliver

-- Oliver Goldschmidt TU Hamburg-Harburg / Universitätsbibliothek / Digitale Dienste Denickestr. 22 21071 Hamburg - Harburg Tel. +49 (0)40 / 428 78 - 32 91 eMail o.goldschmidt@tuhh.de <mailto:o.goldschmidt@tuhh.de> -- GPG/PGP-Schlüssel: http://www.tub.tu-harburg.de/keys/Oliver_Marahrens_pub.asc

_______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz <mailto:Koha@lists.katipo.co.nz> http://lists.katipo.co.nz/mailman/listinfo/koha

-- Jared Camins-Esakov Bibliographer, C & P Bibliography Services, LLC (phone) +1 (917) 727-3445 (e-mail) jcamins@cpbibliography.com <mailto:jcamins@cpbibliography.com> (web) http://www.cpbibliography.com/

Jared Camins-Esakov

1:40 p.m.

Oliver, thank you very much for your reply!

...

In fact I forgot to run zebraidx as user koha-koha, and so you were right: I had bad permissions on the index files in shadow. I tried to fix that by changing the ownership and restarted zebra, but that had no effect. So I guess you are right, that I messed up my index by trying that (which is not too bad; I can still remove the index and try again as koha-koha). I hope nothing else broke by that mistake, and the database is still fine?!

Your best bet actually is to use the koha-rebuild-zebra command, which will automatically use the correct user. But your data is definitely safe, just not searchable yet.

...

To increase the disk space, thats exactly what I did: I changed the value in zebra-biblios.cfg. But can you explain, what the directories are used for? After finishing indexing, will I have data in both directories or could I configure my 100 GB disk, so that both directories can take 80 GB space? I will try that and see...

The register directory stores the actual index, and the shadow directly stores a working copy during the indexing process. If you do a full index with shadow enabled, you'll run out of space on the 100GB disk, but my experience is that in ordinary day-to-day work, it's unlikely you'll have a problem. I still have a problem with rebuild_zebra.pl: it ignores the -s parameter.

...

If I understood that right, rebuild_zebra should use an existing exported_records file, if I use the parameter -s and -d. But it doesn't. Any time I'm starting rebuild_zebra, the script exports my database (this takes pretty much time and I wanted to bypass it). Is this a bug or am I missing anything?

It's always worked for me, so I'm not sure what the problem would be here. Regards, Jared -- Jared Camins-Esakov Bibliographer, C & P Bibliography Services, LLC (phone) +1 (917) 727-3445 (e-mail) jcamins@cpbibliography.com (web) http://www.cpbibliography.com/

Robin Sheat

24 Jul 24 Jul

1:07 p.m.

Op 23/07/13 13:40, Jared Camins-Esakov schreef:

...

If you do a full index with shadow enabled, you'll run out of space on the 100GB disk, but my experience is that in ordinary day-to-day work, it's unlikely you'll have a problem.

It's also worth noting that /tmp (or whatever is defined as your temporary directory) is used to hold the extracted records, so with a large collection, that may become an issue. -- Robin Sheat Catalyst IT Ltd. ✆ +64 4 803 2204 GPG: 5957 6D23 8B16 EFAB FEF8 7175 14D3 6485 A99C EB6D

Oliver Goldschmidt

1:19 p.m.

Jared: thanks again for your hints - it worked now. We have a running zebra and all recordsets have been imported succfully. There is still the mystery with the ignored -s parameter, but everything worked fine. The index finally needs about 60 GB on disk. I saw that the exported_records file (which in our case was about 8 GB after it finished the export) is hold in /tmp and had reserved plenty of space there. So there was not a problem at all. But thank you for the note, Robin! Am 24.07.2013 14:07, schrieb Robin Sheat:

...

Op 23/07/13 13:40, Jared Camins-Esakov schreef:

...
If you do a full index with shadow enabled, you'll run out of space on the 100GB disk, but my experience is that in ordinary day-to-day work, it's unlikely you'll have a problem. It's also worth noting that /tmp (or whatever is defined as your temporary directory) is used to hold the extracted records, so with a large collection, that may become an issue.

4751

Age (days ago)

4752

Last active (days ago)

List overview

Download

6 comments

3 participants

participants (3)

Jared Camins-Esakov
Oliver Goldschmidt
Robin Sheat