[Koha] Koha performance and 6 million records

Ere Maijala ere.maijala at helsinki.fi
Mon Mar 4 21:16:14 NZDT 2019


Hi,

In my experience such fairly large setup is not worth doing with Zebra,
but should be no problem with Elasticsearch. However, whenever there are
questions on performance, most answers would be guesswork apart from
"try it and see how it goes".

For what it's worth, here are my thoughts:

- Use Elasticsearch for a large index. It's faster and uses a lot less
disk space. It can also be moved to a separate host if necessary.

- Faster disk always helps, and even if performance would be adequate
otherwise, fast disk makes many tasks more comfortable to execute.

- With Elasticsearch (and MariaDB to some extent), need to make sure
there's enough memory allocated for them. With 6 million records you may
need to increase the heap reserved for ES, but more important is to have
enough free memory on the server that enough or all of the index can be
cached in memory.

- Having separate host for e.g. Elasticsearch but maybe also MariaDB
makes it easier to manage their resources and make sure that one of them
doesn't hog resources from the others.

- A separate discovery service allows you to separate search load
between staff use and patron use, but you may end up needing double the
resources to run the complete system. Obviously a separate system allows
other flexibility too, but comes with more complex setup.

- Note that while Elasticsearch now works pretty well, there are still
some open issues, and every new Koha release has things improved. I'd
run tests with the latest master version if possible. You may also be
interested in enhancements still in the pipeline, such as parallelizing
the indexing process:
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=21872

As you can see from the above points, I'd consider the search index the
critical part for good performance. Obviously it's also necessary to
make sure MariaDB, Plack etc. get enough resources, but I'd still say
that the search index is the one that makes or breaks it.

Regards,
Ere

Michael Kuhn kirjoitti 3.3.2019 klo 13.15:
> Hi
> 
> We are currently running Koha 18.11 on Debian GNU/Linux 9 (virtual
> machine with 2 processors, 4 GB RAM) and 50'000 bibliographic records in
> a MariaDB database, using Zebra, and of course Plack and Memcached.
> There are only about 160 users. We would like to add 6 million
> bibliographic records (metadata of articles) that before were only
> indexed in the Solr index of a proprietary discovery system, but not in
> Koha itself.
> 
> Since we are lacking experience in such large masses of data we would
> like to investigate about the consequences:
> 
> * Will the overall performance and especially the retrieval experience
> suffer a lot? (note there are only 160 users) We're afraid it will...
> 
> * If yes, is there a way to improve the retrieval experience? For
> example changing from a virtual machine to a dedicated or physical host?
> Adding more resp. faster processors, more RAM? Using SSD disks instead
> of SAS? Changing from Zebra to Elasticsearch? Or will we need to
> implement another discovery system, like Vufind (which is using Solr
> Energy)?
> 
> Any tips or hints are very much appreciated!
> 
> Best wishes: Michael

-- 
Ere Maijala
Kansalliskirjasto / The National Library of Finland


More information about the Koha mailing list