Koha performance and 6 million records
Hi We are currently running Koha 18.11 on Debian GNU/Linux 9 (virtual machine with 2 processors, 4 GB RAM) and 50'000 bibliographic records in a MariaDB database, using Zebra, and of course Plack and Memcached. There are only about 160 users. We would like to add 6 million bibliographic records (metadata of articles) that before were only indexed in the Solr index of a proprietary discovery system, but not in Koha itself. Since we are lacking experience in such large masses of data we would like to investigate about the consequences: * Will the overall performance and especially the retrieval experience suffer a lot? (note there are only 160 users) We're afraid it will... * If yes, is there a way to improve the retrieval experience? For example changing from a virtual machine to a dedicated or physical host? Adding more resp. faster processors, more RAM? Using SSD disks instead of SAS? Changing from Zebra to Elasticsearch? Or will we need to implement another discovery system, like Vufind (which is using Solr Energy)? Any tips or hints are very much appreciated! Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch
Hi, In my experience such fairly large setup is not worth doing with Zebra, but should be no problem with Elasticsearch. However, whenever there are questions on performance, most answers would be guesswork apart from "try it and see how it goes". For what it's worth, here are my thoughts: - Use Elasticsearch for a large index. It's faster and uses a lot less disk space. It can also be moved to a separate host if necessary. - Faster disk always helps, and even if performance would be adequate otherwise, fast disk makes many tasks more comfortable to execute. - With Elasticsearch (and MariaDB to some extent), need to make sure there's enough memory allocated for them. With 6 million records you may need to increase the heap reserved for ES, but more important is to have enough free memory on the server that enough or all of the index can be cached in memory. - Having separate host for e.g. Elasticsearch but maybe also MariaDB makes it easier to manage their resources and make sure that one of them doesn't hog resources from the others. - A separate discovery service allows you to separate search load between staff use and patron use, but you may end up needing double the resources to run the complete system. Obviously a separate system allows other flexibility too, but comes with more complex setup. - Note that while Elasticsearch now works pretty well, there are still some open issues, and every new Koha release has things improved. I'd run tests with the latest master version if possible. You may also be interested in enhancements still in the pipeline, such as parallelizing the indexing process: https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=21872 As you can see from the above points, I'd consider the search index the critical part for good performance. Obviously it's also necessary to make sure MariaDB, Plack etc. get enough resources, but I'd still say that the search index is the one that makes or breaks it. Regards, Ere Michael Kuhn kirjoitti 3.3.2019 klo 13.15:
Hi
We are currently running Koha 18.11 on Debian GNU/Linux 9 (virtual machine with 2 processors, 4 GB RAM) and 50'000 bibliographic records in a MariaDB database, using Zebra, and of course Plack and Memcached. There are only about 160 users. We would like to add 6 million bibliographic records (metadata of articles) that before were only indexed in the Solr index of a proprietary discovery system, but not in Koha itself.
Since we are lacking experience in such large masses of data we would like to investigate about the consequences:
* Will the overall performance and especially the retrieval experience suffer a lot? (note there are only 160 users) We're afraid it will...
* If yes, is there a way to improve the retrieval experience? For example changing from a virtual machine to a dedicated or physical host? Adding more resp. faster processors, more RAM? Using SSD disks instead of SAS? Changing from Zebra to Elasticsearch? Or will we need to implement another discovery system, like Vufind (which is using Solr Energy)?
Any tips or hints are very much appreciated!
Best wishes: Michael
-- Ere Maijala Kansalliskirjasto / The National Library of Finland
Hi Ere Many thanks for your thoughts and technical tips! For the moment the library decided not to load their 6 million records into Koha (with Zebra) but to index them separately, either in Vufind or in ALBERT (a discovery system developed by the KOBV, a German union catalog) which like Elasticsearch all are using Solr Lucene. The main reason for the decision is the actual development status Koha using Elasticsearch. However, this will probably be reconsiderated one day in the future when Elasticsearch is fully integrated into Koha, all/most bugs omitted, and when Zebra definitely will be gone. Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch
participants (2)
-
Ere Maijala -
Michael Kuhn