[Koha] Proposal to form Koha Technical Committee

Mon Nov 22 19:56:25 NZDT 2010

Apparently, I had not been explicitly clear in a couple of messages
relating to Solr/Lucene.

Being as clear as I can be briefly:

I favour using Solr/Lucene for local indexing in Koha.

I oppose losing support for Zebra as a Z39.50/SRU server without a
replacement with a sufficiently comparable feature set.  There is no other
free software Z39.50/SRU server with a sufficiently comparable feature set
at present.

I also oppose losing the Z39.50 client support in Koha and other useful
searching features which had been developed for Zebra.  Such a loss would
affect future development but not currently implemented features if
Solr/Lucene would replace Zebra.

BibLibre's prospective rewrite of searching does not affect the current
Koha Z39.50 client for copy cataloguing which does not rely upon the Koha
search module, C4::Search.  Only possible improvements in copy cataloguing
would be affected if some Koha search code would no longer be available as
a model.

Anyone who has worked with Zebra knows that it can be like working with a
mysterious black box when updating indexes at least.  Reports of failures
from Zebra with no error message and I presume conversely no success code
on success need close investigation if we are relying on Zebra for any
purpose.

Good error reporting is vital for good software.  No software ever 'just
works'.  The presumption that software 'just works' comes from not
examining it closely enough or because errors are within some tolerance
range for errors.

After determining that Solr/Lucene is now suitable for Koha, I have given
my attention to seeing what would need to be done to improve some
Z39.50/SRU server option to have a sufficiently comparable feature set to
Zebra but with support for Solr/Lucene.  The attention which I have given
to the BibLibre suggestion of JZKit as a replacement Z39.50/SRU server may
have led me to neglect other things such as the full significance of
BibLibre's use of the Data::SearchEngine Perl module for adding
Solr/Lucene support to Koha.

Remainder of reply inline:

On Sat, November 20, 2010 04:49, LAURENT Henri-Damien wrote:
> Le 19/11/2010 22:41, Thomas Dukleth a écrit :
>> Reply inline:
>>
> [snip]>

1.  QUALITIES OF SOLR/LUCENE.

>> Much of the capabilities which Zebra support provides are not being used
>> in Koha and we are comparing our own familiarity with Zebra difficulties
>> with a rosy ideal of what Solr/Lucene offers.  The first advantage of
>> Solr/Lucene is that it empowers less sophisticated users to control how
>> indexing is configured via a web interface.  A web interface could be
>> created for Zebra but that is not an existing feature.  I am for
>> everything which empowers users to more easily exercise control over
>> their
>> software.
> As far as I know, solr is succesfully everyday used in many opensource
> OPACS, and other projects too (thunderbird, alfresco, Drupal, .... ).
> I donot claim that they are better than we do. But why should we doubt
> that this solution, widely used, which is a real kind of standard in
> indexing engine would be a good one ?

I do not doubt that Solr/Lucene provides for good indexing and searching. 
At the time which the projects which Henri-Damien Laurent lists started
using Solr/Lucene, the Solr/Lucene feature set was not sufficiently
sophisticated  for Koha and clearly less well developed than Zebra. 
Lucene would have been sufficient by 2006 but not the simplified subset of
features supplied by Solr/Lucene at the time.  See my koha-devel list
message about the issue,
http://lists.koha-community.org/pipermail/koha-devel/2010-October/034468.html
.

> a) Getting and providing Suppport fot that tool would be easier. Solr
> community exists.

A large independent software community is certainly better than depending
on the small library market.  However, libraries have additional special
needs, such as Z39.50/SRU support which the wider market should not be
expected to provide.

2.  OVERSIMPLIFICATION.

> b) Since it has been used in Vufind, and Blacklight, i think we could
> share experiences more easily, Eventually direct bridges between Koha
> and those solutions.

We may well be able to learn much from the experience which VuFind and
Blacklight have had with Solr/Lucene.  They are attractive OPACs which do
not have the burden of providing a proper library automation system. 
However, in ignoring library science principles, they have not served
their users as well as they might have otherwise.

The most significant advantages of VuFind and Blacklight are a primitive
implementation of faceted browsing from the result set.  Their model
copied the same primitive model provided by Endeca for NCSU,
http://www2.lib.ncsu.edu/catalog .  Facets are treated as mere text
strings often without contextual meaning.  Subfields are treated in
isolation with no contextual association, most usually only subfield $a
appears in facets.  Authority control is not used for authority controlled
fields.

In adding faceting from the result set to Koha, Joshua Ferraro followed
the fashionable interest of the moment in the NCSU Endeca OPAC which was
also the easiest to implement.  Endeca had done much better work for
non-library customers previously.

I tried to interest Joshua in using a model more like facet browsing
systems such as AquaBrowser, interfaces from OVID, and others.  I even
added my own more flexible design to the Koha wiki.  Joshua continued to
advocate the fashion for the NCSU Endeca OPAC and added a claim that
LibLime users would not want a more sophisticated and necessarily more
complex model.

The BibLibre Solr/Lucene implementation seems to match the existing Koha
implementation of facets which is no detriment to the hard work of
BibLibre.  A more sophisticated faceting model can always be provided for
Koha in future.

When cooperating with other projects, we should be aware that other
projects have had limited goals.  I hope that Koha will have as much to
teach other projects as to learn from them in future.

>
>
>>

3.  Z39.50 CLIENT SUPPORT.

>> There is more than just the problem of losing very good Z39.50/SRU
>> server
>> support which would follow from BibLibre's announced implementation of
>> Solr/Lucene for local indexing.  Z39.50 client support which was
>> undertaken for local indexing with Zebra could enable future Z39.50/SRU
>> client development without Zebra.  In rewriting searching for their own
>> testing branch of Koha for Solr/Lucene implementation, BibLibre have
>> removed valuable Z39.50 client support which could be used in future
>> features such as querying Z39.50 servers in addition to Solr/Lucene in
>> the
>> OPAC and presenting a unified result set using Pazpar2.

Support for Pazpar2 may be the more significant part of the potential loss
in this case.  Koha has no UNIMARC specific support for Pazpar2,
therefore, the possibility that people at BibLibre may under-appreciate
the benefit of Pazpar2 is understandable.

3.1.  Z39.50 CLIENT COPY CATALOGUING SUPPORT.

>>  Z39.50 client
>> copy cataloguing support is independent from C4::Search and is thus safe
>> but could not be improved using code from C4::Search if that code is
>> gone.
> It was done in koha2.2 and as far as I know, without zebra, this feature
> still exists in our testing box.
> It is just using direct ZOOM search rather than C4::Search::SimpleSearch.
> So this valuable feature is still there. And we care not to break
> existing features.

I should have put the Z39.50 client copy cataloguing sentence in a
paragraph of its own.  Henri-Damien seems to have misunderstood.  My
sentence clearly states that the Z39.50 client copy cataloguing support is
safe.  My concern is about the potential loss of search code which could
be used to improve the copy cataloguing client in possible future
development.

4.  SIMPLESERVER.

> Same for the Z3950 server which we wrote a wrapper using
> Net::Z3950::SimpleServer. This would allow ppl to expose their
> collection as a Z3950 server.

Net::Z39.50::SimpleServer has been on my list as one option which would
need improvement to have features sufficiently comparable to Zebra as a
Z39.50 Server.  'Simple' may be taken to mean that supporting complexity
is an exercise left to the programmer using the tool.  Implementations of
SimpleServer generally only support use attributes because of the
complexity of managing more.  Below, Henri-Damien attests to the
difficulties of parsing PQF reliably.  SimpleServer does not support SRU. 
The documentation for SimpleServer is less complete than Zebra
documentation.  We also have much greater working knowledge of Zebra.

I will enquire with IndexData about what options there might be for adding
SRU support to SimpleServer.

5.  PREFIX QUERY FORMAT.

>
>>
>> As I have stated previously, C4::Search ought to be rewritten in future
>> using prefix query format (PQF) as the native language for Z39.50.
> Well, Rewriting the whole Search With PQF would not be handy in many
> aspects.  I have thought about that many times. And using PQF still
> appears to me not to be the solution.
> a) It is really a pain to maintain and analyse.
> b) Whenever you need some more feature in your search, you have to add
> some more qualifiers, and therefore provide a robust parser.

I agree that analysing PQF connectors is tricky in comparison to CCL
connectors which Yaz converts to PQF.  In my own work independent of Koha,
I overcame difficulties by having the user interface display the PQF query
which my code would generate.

I have code for writing PQF query sets which I had started in 2005 using
PHP Yaz before Net::Z3950::ZOOM was available for Perl.  My code supports
the complete and I mean complete Bib-1 attribute set.  I could port the
code to Perl with a little effort, although, it would still need some work
for more extensibility of the user controlled term sets.

Queries built using PQF can be tested by building the same query in CCL
and sending it to Yaz for conversion to PQF as a comparison.

Writing a PQF parser to interpret incoming PQF queries for SimpleServer
would be perhaps an order of magnitude more complex.

>> Continuing to support Common Command Language (CCL) is trivial because
>> Yaz
>> translates CCL to PQF as it does now for Koha.

6.  CODE ABSTRACTION.

In the context of
>> refactoring to support all options, future rewriting of C4::Search for
>> Zebra would be rewriting it in C4::Search::Zebra along with
>> C4::Search::Solr and C4::Search would become a neutral abstraction.
> Again, we used Data::SearchEngine as a base for coding, and use the
> results from qurying Data::SearchEngine
> If anyone is willing to write a Data::SearchEngine::Zebra,
> Data::SearchEngine::Nutch, Data::SearchEngine::Pazpar2, feel free to do
> so. We will contribute not just to our little community. But to the
> wider PERL community.

I had not given enough attention to Data::SearchEngine when giving as much
attention as I have to investigating JZKit.  Supporting the wider Perl
community would be great.

I am somewhat doubtful that Data::SearchEngine::Zebra or 
Data::SearchEngine::Pazpar2 would have much interest outside the Perl for
libraries community but that would be much better than merely supporting
the Koha community.

I recognise that creating Data::SearchEngine::Zebra and
Data::SearchEngine::Pazpar2 would require a rewrite rather than merely
refactoring.  Maintaining Zebra searching should not be necessary for
maintaining Zebra as a Z39.50/SRU server.

7.  USE CASES AND TESTING.

>
>>  When
>> the opportunity to rewrite the Zebra code would arise, rewriting would
>> be
>> much easier with known working code to use as a starting point, however
>> imperfects.
> This is why we are willing to gather use cases.

Ian Ibbotson from Knowledge Integration informed me about local authority
libraries in the UK using Z39.50 servers for conducting interlibrary loans
between each other.  Circulation issue are not my speciality and I asked
him for more information about how Z39.50 servers are used as part of the
interlibrary loan process.  I would be pleased to have an answer from
anyone.

>  To build Tests !!!
> No tests have ever really been built for C4::Search.
> So at the moment, I can't assign your fear of loss of features. Provide
> tests and will do our best to make it pass. If it is proven that we
> can't then we will talk precisely.

Please excuse a moment of fun.  I could devise some search query tests
which I think that a good record indexing and retrieval system out to pass
but which Koha would always fail.

Failing to pass some good tests which I could devise would certainly not
be a regression not only because there have never been tests for searching
in Koha.  There have also never been very demanding specifications for
searching in Koha to which tests could have been written apart from the
great work which BibLibre have done to allowing authority records to be
found by authority tracing and reference fields containing unauthorised
terms.

8.  RELEVANCY RANKING OF COMPLETE FIELD OR SUBFIELD MATCHES.

> By the way, did you know that the "valuable" feature you told about
> searching "The" was not really a zebra feature as you stated. But
> provided by the Search itself. It duplicated the old need for null words
> in Koha. But then again, there was no tests for that.

This another point at which Henri-Damien has apparently misunderstood me.

I had not claimed that ranking a complete field or subfield match to the
top of the result list even for words some systems treat as stopwords is a
Zebra feature,
http://lists.koha-community.org/pipermail/koha-devel/2010-October/034477.html
.  I merely expressed my concern that the feature which Joshua Ferraro had
written for searching might be an example of good code which could be lost
as part of simplification reducing the search code by 90% in BibLibre's
rewriting of Koha search code.  A title containing 'the' as the only word
in a title may not match a realistic case but the principle would be the
same.  Real world examples of titles which would not be indexed or have
low relevancy rankings in many systems include "A" by Andy Warhol and "It"
by Stephan King.

I do not know that 90% of the Koha search code would be unnecessary
without Zebra but maybe with the virtue of switching to Data::SearchEngine
would necessitate just such a degree of simplification while maintaining
features.

Much code could be saved if Solr/Lucene manages what the Koha search code
now needs to manage for local searching when using Zebra.  Some code is
almost certainly inefficient and overly complex without benefit.

Code simplicity is not a virtue above all others.  Feature sophistication
often requires irreducible code complexity to address irreducible
complexities in the world.  Sophisticated features tend to take more code
to implement than less sophisticated ones.  Code may need complexity to
properly provide simplicity and ease of use for the user.

9.  FURTHER EXAMINATION.

>
>>
>> I have raised all of these issues on the koha-devel list without much
>> specific response.  On the koha-devel list, MJ Ray has also more
>> recently
>> asked for more details about Zebra problems in a form which could be
>> independently examined by others and he should have more complete
>> replies
>> than he has had thus far,
>> http://lists.koha-community.org/pipermail/koha-devel/2010-November/034664.html
> Well I wanted a meeting to be held and we were working on testing and
> trying to assing all the raised problems in order to provide you with
> more technical details. Expect another message from us soon.

A meeting may be helpful but is unlikely to resolve much without further
testing and detailed discussion.  Some have suggested that Koha code is at
fault for some reported problems with Zebra.  I could use my own well
tested Z39.50 client to test some issues independently from Koha if there
is a Zebra server with problematic records and configuration which I can
access for that purpose.

[...]

Thomas Dukleth
Agogme
109 E 9th Street, 3D
New York, NY  10003
USA
http://www.agogme.com
+1 212-674-3783