Apparently, I had not been explicitly clear in a couple of messages relating to Solr/Lucene. Being as clear as I can be briefly: I favour using Solr/Lucene for local indexing in Koha. I oppose losing support for Zebra as a Z39.50/SRU server without a replacement with a sufficiently comparable feature set. There is no other free software Z39.50/SRU server with a sufficiently comparable feature set at present. I also oppose losing the Z39.50 client support in Koha and other useful searching features which had been developed for Zebra. Such a loss would affect future development but not currently implemented features if Solr/Lucene would replace Zebra. BibLibre's prospective rewrite of searching does not affect the current Koha Z39.50 client for copy cataloguing which does not rely upon the Koha search module, C4::Search. Only possible improvements in copy cataloguing would be affected if some Koha search code would no longer be available as a model. Anyone who has worked with Zebra knows that it can be like working with a mysterious black box when updating indexes at least. Reports of failures from Zebra with no error message and I presume conversely no success code on success need close investigation if we are relying on Zebra for any purpose. Good error reporting is vital for good software. No software ever 'just works'. The presumption that software 'just works' comes from not examining it closely enough or because errors are within some tolerance range for errors. After determining that Solr/Lucene is now suitable for Koha, I have given my attention to seeing what would need to be done to improve some Z39.50/SRU server option to have a sufficiently comparable feature set to Zebra but with support for Solr/Lucene. The attention which I have given to the BibLibre suggestion of JZKit as a replacement Z39.50/SRU server may have led me to neglect other things such as the full significance of BibLibre's use of the Data::SearchEngine Perl module for adding Solr/Lucene support to Koha. Remainder of reply inline: On Sat, November 20, 2010 04:49, LAURENT Henri-Damien wrote:
Le 19/11/2010 22:41, Thomas Dukleth a écrit :
Reply inline:
[snip]>
1. QUALITIES OF SOLR/LUCENE.
Much of the capabilities which Zebra support provides are not being used in Koha and we are comparing our own familiarity with Zebra difficulties with a rosy ideal of what Solr/Lucene offers. The first advantage of Solr/Lucene is that it empowers less sophisticated users to control how indexing is configured via a web interface. A web interface could be created for Zebra but that is not an existing feature. I am for everything which empowers users to more easily exercise control over their software. As far as I know, solr is succesfully everyday used in many opensource OPACS, and other projects too (thunderbird, alfresco, Drupal, .... ). I donot claim that they are better than we do. But why should we doubt that this solution, widely used, which is a real kind of standard in indexing engine would be a good one ?
I do not doubt that Solr/Lucene provides for good indexing and searching. At the time which the projects which Henri-Damien Laurent lists started using Solr/Lucene, the Solr/Lucene feature set was not sufficiently sophisticated for Koha and clearly less well developed than Zebra. Lucene would have been sufficient by 2006 but not the simplified subset of features supplied by Solr/Lucene at the time. See my koha-devel list message about the issue, http://lists.koha-community.org/pipermail/koha-devel/2010-October/034468.htm... .
a) Getting and providing Suppport fot that tool would be easier. Solr community exists.
A large independent software community is certainly better than depending on the small library market. However, libraries have additional special needs, such as Z39.50/SRU support which the wider market should not be expected to provide. 2. OVERSIMPLIFICATION.
b) Since it has been used in Vufind, and Blacklight, i think we could share experiences more easily, Eventually direct bridges between Koha and those solutions.
We may well be able to learn much from the experience which VuFind and Blacklight have had with Solr/Lucene. They are attractive OPACs which do not have the burden of providing a proper library automation system. However, in ignoring library science principles, they have not served their users as well as they might have otherwise. The most significant advantages of VuFind and Blacklight are a primitive implementation of faceted browsing from the result set. Their model copied the same primitive model provided by Endeca for NCSU, http://www2.lib.ncsu.edu/catalog . Facets are treated as mere text strings often without contextual meaning. Subfields are treated in isolation with no contextual association, most usually only subfield $a appears in facets. Authority control is not used for authority controlled fields. In adding faceting from the result set to Koha, Joshua Ferraro followed the fashionable interest of the moment in the NCSU Endeca OPAC which was also the easiest to implement. Endeca had done much better work for non-library customers previously. I tried to interest Joshua in using a model more like facet browsing systems such as AquaBrowser, interfaces from OVID, and others. I even added my own more flexible design to the Koha wiki. Joshua continued to advocate the fashion for the NCSU Endeca OPAC and added a claim that LibLime users would not want a more sophisticated and necessarily more complex model. The BibLibre Solr/Lucene implementation seems to match the existing Koha implementation of facets which is no detriment to the hard work of BibLibre. A more sophisticated faceting model can always be provided for Koha in future. When cooperating with other projects, we should be aware that other projects have had limited goals. I hope that Koha will have as much to teach other projects as to learn from them in future.
3. Z39.50 CLIENT SUPPORT.
There is more than just the problem of losing very good Z39.50/SRU server support which would follow from BibLibre's announced implementation of Solr/Lucene for local indexing. Z39.50 client support which was undertaken for local indexing with Zebra could enable future Z39.50/SRU client development without Zebra. In rewriting searching for their own testing branch of Koha for Solr/Lucene implementation, BibLibre have removed valuable Z39.50 client support which could be used in future features such as querying Z39.50 servers in addition to Solr/Lucene in the OPAC and presenting a unified result set using Pazpar2.
Support for Pazpar2 may be the more significant part of the potential loss in this case. Koha has no UNIMARC specific support for Pazpar2, therefore, the possibility that people at BibLibre may under-appreciate the benefit of Pazpar2 is understandable. 3.1. Z39.50 CLIENT COPY CATALOGUING SUPPORT.
Z39.50 client copy cataloguing support is independent from C4::Search and is thus safe but could not be improved using code from C4::Search if that code is gone. It was done in koha2.2 and as far as I know, without zebra, this feature still exists in our testing box. It is just using direct ZOOM search rather than C4::Search::SimpleSearch. So this valuable feature is still there. And we care not to break existing features.
I should have put the Z39.50 client copy cataloguing sentence in a paragraph of its own. Henri-Damien seems to have misunderstood. My sentence clearly states that the Z39.50 client copy cataloguing support is safe. My concern is about the potential loss of search code which could be used to improve the copy cataloguing client in possible future development. 4. SIMPLESERVER.
Same for the Z3950 server which we wrote a wrapper using Net::Z3950::SimpleServer. This would allow ppl to expose their collection as a Z3950 server.
Net::Z39.50::SimpleServer has been on my list as one option which would need improvement to have features sufficiently comparable to Zebra as a Z39.50 Server. 'Simple' may be taken to mean that supporting complexity is an exercise left to the programmer using the tool. Implementations of SimpleServer generally only support use attributes because of the complexity of managing more. Below, Henri-Damien attests to the difficulties of parsing PQF reliably. SimpleServer does not support SRU. The documentation for SimpleServer is less complete than Zebra documentation. We also have much greater working knowledge of Zebra. I will enquire with IndexData about what options there might be for adding SRU support to SimpleServer. 5. PREFIX QUERY FORMAT.
As I have stated previously, C4::Search ought to be rewritten in future using prefix query format (PQF) as the native language for Z39.50.
Well, Rewriting the whole Search With PQF would not be handy in many aspects. I have thought about that many times. And using PQF still appears to me not to be the solution. a) It is really a pain to maintain and analyse. b) Whenever you need some more feature in your search, you have to add some more qualifiers, and therefore provide a robust parser.
I agree that analysing PQF connectors is tricky in comparison to CCL connectors which Yaz converts to PQF. In my own work independent of Koha, I overcame difficulties by having the user interface display the PQF query which my code would generate. I have code for writing PQF query sets which I had started in 2005 using PHP Yaz before Net::Z3950::ZOOM was available for Perl. My code supports the complete and I mean complete Bib-1 attribute set. I could port the code to Perl with a little effort, although, it would still need some work for more extensibility of the user controlled term sets. Queries built using PQF can be tested by building the same query in CCL and sending it to Yaz for conversion to PQF as a comparison. Writing a PQF parser to interpret incoming PQF queries for SimpleServer would be perhaps an order of magnitude more complex.
Continuing to support Common Command Language (CCL) is trivial because Yaz translates CCL to PQF as it does now for Koha.
6. CODE ABSTRACTION. In the context of
refactoring to support all options, future rewriting of C4::Search for Zebra would be rewriting it in C4::Search::Zebra along with C4::Search::Solr and C4::Search would become a neutral abstraction. Again, we used Data::SearchEngine as a base for coding, and use the results from qurying Data::SearchEngine If anyone is willing to write a Data::SearchEngine::Zebra, Data::SearchEngine::Nutch, Data::SearchEngine::Pazpar2, feel free to do so. We will contribute not just to our little community. But to the wider PERL community.
I had not given enough attention to Data::SearchEngine when giving as much attention as I have to investigating JZKit. Supporting the wider Perl community would be great. I am somewhat doubtful that Data::SearchEngine::Zebra or Data::SearchEngine::Pazpar2 would have much interest outside the Perl for libraries community but that would be much better than merely supporting the Koha community. I recognise that creating Data::SearchEngine::Zebra and Data::SearchEngine::Pazpar2 would require a rewrite rather than merely refactoring. Maintaining Zebra searching should not be necessary for maintaining Zebra as a Z39.50/SRU server. 7. USE CASES AND TESTING.
When the opportunity to rewrite the Zebra code would arise, rewriting would be much easier with known working code to use as a starting point, however imperfects. This is why we are willing to gather use cases.
Ian Ibbotson from Knowledge Integration informed me about local authority libraries in the UK using Z39.50 servers for conducting interlibrary loans between each other. Circulation issue are not my speciality and I asked him for more information about how Z39.50 servers are used as part of the interlibrary loan process. I would be pleased to have an answer from anyone.
To build Tests !!! No tests have ever really been built for C4::Search. So at the moment, I can't assign your fear of loss of features. Provide tests and will do our best to make it pass. If it is proven that we can't then we will talk precisely.
Please excuse a moment of fun. I could devise some search query tests which I think that a good record indexing and retrieval system out to pass but which Koha would always fail. Failing to pass some good tests which I could devise would certainly not be a regression not only because there have never been tests for searching in Koha. There have also never been very demanding specifications for searching in Koha to which tests could have been written apart from the great work which BibLibre have done to allowing authority records to be found by authority tracing and reference fields containing unauthorised terms. 8. RELEVANCY RANKING OF COMPLETE FIELD OR SUBFIELD MATCHES.
By the way, did you know that the "valuable" feature you told about searching "The" was not really a zebra feature as you stated. But provided by the Search itself. It duplicated the old need for null words in Koha. But then again, there was no tests for that.
This another point at which Henri-Damien has apparently misunderstood me. I had not claimed that ranking a complete field or subfield match to the top of the result list even for words some systems treat as stopwords is a Zebra feature, http://lists.koha-community.org/pipermail/koha-devel/2010-October/034477.htm... . I merely expressed my concern that the feature which Joshua Ferraro had written for searching might be an example of good code which could be lost as part of simplification reducing the search code by 90% in BibLibre's rewriting of Koha search code. A title containing 'the' as the only word in a title may not match a realistic case but the principle would be the same. Real world examples of titles which would not be indexed or have low relevancy rankings in many systems include "A" by Andy Warhol and "It" by Stephan King. I do not know that 90% of the Koha search code would be unnecessary without Zebra but maybe with the virtue of switching to Data::SearchEngine would necessitate just such a degree of simplification while maintaining features. Much code could be saved if Solr/Lucene manages what the Koha search code now needs to manage for local searching when using Zebra. Some code is almost certainly inefficient and overly complex without benefit. Code simplicity is not a virtue above all others. Feature sophistication often requires irreducible code complexity to address irreducible complexities in the world. Sophisticated features tend to take more code to implement than less sophisticated ones. Code may need complexity to properly provide simplicity and ease of use for the user. 9. FURTHER EXAMINATION.
I have raised all of these issues on the koha-devel list without much specific response. On the koha-devel list, MJ Ray has also more recently asked for more details about Zebra problems in a form which could be independently examined by others and he should have more complete replies than he has had thus far, http://lists.koha-community.org/pipermail/koha-devel/2010-November/034664.ht...
Well I wanted a meeting to be held and we were working on testing and trying to assing all the raised problems in order to provide you with more technical details. Expect another message from us soon.
A meeting may be helpful but is unlikely to resolve much without further testing and detailed discussion. Some have suggested that Koha code is at fault for some reported problems with Zebra. I could use my own well tested Z39.50 client to test some issues independently from Koha if there is a Zebra server with problematic records and configuration which I can access for that purpose. [...] Thomas Dukleth Agogme 109 E 9th Street, 3D New York, NY 10003 USA http://www.agogme.com +1 212-674-3783