[Koha] OAI-PMH harvester

Mike D. black23 at gmail.com
Wed Nov 23 03:57:19 NZDT 2022


Hey. Hey,
I'm really glad to see the OAI-PMH harvester debate going on for Koha. I
think if we choose a good external harvester with support, we can save a
lot of energy and resources to implement related activities in the system.
Shoveling the logs is only part of the story. The easy part. Since the
result of shoveling is a lot of records, most of the time we can't avoid
post-processing, merging with the records in the local database. For
example, if you need to update records from a source where there are
millions of records, but there are hundreds of thousands in the local
database. Only a slice of that huge amount is relevant. If we design the
processing workflow wrong, it will take unnecessarily long and burn
valuable resources.
I would hereby like to invite us to be in touch, to debate and share our
experiences. Let's get this area moving towards a successful finish.

Take care.

Michal

út 22. 11. 2022 v 15:13 odesílatel BOUIS Sonia <sonia.bouis at univ-lyon3.fr>
napsal:

> Hi,
> Thanks to David, Tomas, Michal and Michael for your replies.
>
> So we have decided to evaluate several external OAI-PMH client that could
> be used by Koha and to choose one in the end of January
> There a lot to do after that and we discussed about the background jobs
> and cronjobs seems to be appropriate. We thought that the settings in the
> koha intranet should be only to define URLs, SETs, or XSLT sheets (for
> example, to transform DC XML in MARCXML).
>
> We are only at the begining of the process 😊
>
> Kind regards,
> Sonia
>
> ------------------------------
>
> Message: 2
> Date: Wed, 26 Oct 2022 10:37:49 +1100
> From: "David Cook" <dcook at prosentient.com.au>
> To: "'Tomas Cohen Arazi'" <tomascohen at gmail.com>, "'BOUIS Sonia'"
>         <sonia.bouis at univ-lyon3.fr>
> Cc: "'koha'" <koha at lists.katipo.co.nz>, "'koha-devel'"
>         <koha-devel at lists.koha-community.org>
> Subject: Re: [Koha-devel] [Koha] OAI-PMH harvester
> Message-ID: <07af01d8e8ca$dfbddef0$9f399cd0$@prosentient.com.au>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Sonia,
>
>
>
> I’m excited to hear that KohaLA would like to finance an OAI-PMH client in
> Koha! This functionality is always brewing in the back of my mind, since I
> first raised 10662 back in 2013.
>
>
>
> As Tomas says, I think that the background jobs are a key component for
> processing incoming OAI-PMH records.
>
>
>
> However, the ***missing component right now is the scheduling of the
> OAI-PMH harvesting tasks***, and I think this is where opinions get
> divided. Below, I’ll provide some history and opinions on Koha OAI-PMH.
>
>
>
> --
>
>
>
> With 10662, the sponsored goal was for Koha library staff to schedule
> OAI-PMH harvests through the Web UI. However, Fridolin from BibLibre raised
> a point with me at Kohacon18 about how letting library staff control the
> timing of harvesting tasks could be a problem for support vendors. If too
> many libraries using the same public IP address tried to harvest from the
> same OAI-PMH repository, they could be rate limited or blocked. There could
> also be server load concerns. So there probably needs to be a balance
> between user configuration and system configuration. If I recall correctly,
> this is how DSpace’s OAI-PMH harvester works. Users set up targets and can
> start/stop harvests, but things like frequency and concurrency are handled
> by the system configuration.
>
>
>
> Based on my experience working on OAI-PMH on and off for nearly 10 years
> and as a Koha support vendor, I think my preference would be for sysadmins
> to handle most of the OAI-PMH harvesting details.
>
>
>
> The sponsorship for 10662 had certain requirements that many other
> libraries might not have, which is what made me think that it might be
> better to have an external client that connects to Koha. I thought maybe I
> could get the ordinary requirements pushed into Koha, and then handle
> extraordinary requirements externally. However, an external harvester won’t
> perform as fast as an internal harvester. (The compromise would be to write
> the harvester in such a way that people could provide different OAI-PMH
> harvester Perl modules that all stage records using the same core Koha
> modules.)
>
>
>
> Even then… the scheduling would depend on a library’s needs. Back in 2013,
> I had a Koha OAI-PMH harvester which worked as a cronjob. It would run each
> night. However, some libraries want to run OAI-PMH harvests as frequently
> as every 3 seconds. A cronjob’s smallest frequency is 60 seconds, so that
> wouldn’t work for that requirement.
>
>
>
> If a cronjob isn’t suitable, then I think you’d need a daemon created by a
> new command like “koha-oai --start <instance_name>”. It could read a
> configuration file and handle scheduling accordingly. With 10662, I used
> the POE module, because I knew it well and it has some timer tools for
> scheduling tasks. If I were to work on it again, I’d probably use
> Mojo::IOLoop instead these days, since Mojolicious is already part of Koha
> while POE is not. (That said, using modules like Mojo and POE are
> difficult, because they’re difficult to test using automation. That was one
> of the stumbling blocks with 10662. While the 10662 harvester worked very
> well, it was difficult to unit test. In hindsight, I should’ve written it
> in a way that was easier to unit test, but it had a lot of event-driven
> code which made things more difficult.)
>
>
>
> Another option would be to create a generic daemon for task scheduling in
> general (e.g. “koha-schedule”). Koha could use this for many things, but
> it’s a project in itself.
>
>
>
> --
>
>
>
> The process of downloading OAI-PMH records and importing MARCXML into Koha
> is actually a fairly straightforward process. The difficulty is the task
> scheduling and management of tasks (and unit testing).
>
>
>
> I don’t know the answer that will make everyone happy. There’s lots of
> different ways of managing and scheduling the tasks. Based on my
> experience, I’d suggest targeting the simplest approach first, because
> complexity will make it less likely for the project to succeed.
>
>
>
> On that note, I’d be happy to test/QA any OAI-PMH harvester put forward.
> When I was writing OAI-PMH harvester patches, I found it really hard to get
> QA, so I’m happy to be that resource for someone else. I’ve spent a lot of
> time thinking about this topic, so happy to provide advice, warnings,
> emotional support 😉.
>
>
>
> David Cook
>
> Senior Software Engineer
>
> Prosentient Systems
>
> Suite 7.03
>
> 6a Glen St
>
> Milsons Point NSW 2061
>
> Australia
>
>
>
> Office: 02 9212 0899
>
> Online: 02 8005 0595
>
>
>
> From: Koha-devel <koha-devel-bounces at lists.koha-community.org> On Behalf
> Of Tomas Cohen Arazi
> Sent: Wednesday, 26 October 2022 3:46 AM
> To: BOUIS Sonia <sonia.bouis at univ-lyon3.fr>
> Cc: koha <koha at lists.katipo.co.nz>; koha-devel <
> koha-devel at lists.koha-community.org>
> Subject: Re: [Koha-devel] [Koha] OAI-PMH harvester
>
>
>
> I think with background jobs we have most of the framework that is needed
> to deal with this within Koha.
>
>
>
> Best regards
>
>
>
> El mar, 25 oct 2022 7:08, BOUIS Sonia <sonia.bouis at univ-lyon3.fr <mailto:
> sonia.bouis at univ-lyon3.fr> > escribió:
>
> Hi,
> KohaLA would like to finance an OAI-PMH client in Koha but, we have
> questions that we want to raise to the community.
> There was already tries to propose an OAI-PMH client :
> - https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662 : it's
> an old project that doesnt seem compatible with the current version of Koha
> - https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=25905 : the
> scope is more to use an external OAI-PMH client and to connect it to Koha
>
> Our main question is about the way to handle this. Do you think that it's
> a better idea to use an external software or PERL routine and to find a way
> to connect it to Koha. Or would it be better to a new module in Koha from
> scratch and that Koha have his own OAI-PMH client.
>
> Please, let us hear your toughts about this projet.
>
> Kind regards
>
> Sonia
>
> Sonia BOUIS
> ------------------------------------------------------
> Responsable du Service informatique documentaire Département d'Appui à la
> Recherche et aux Projets (DARP) Bibliothèques universitaires Université
> Jean Moulin Lyon 3 ADRESSE GÉOGRAPHIQUE > Manufacture des Tabacs | 6 cours
> Albert Thomas | LYON 8e ADRESSE POSTALE > Bibliothèque de la Manufacture |
> 1C avenue des Frères Lumière | CS 78242 - 69372 LYON CEDEX 08
>
> Ligne directe : 33 (0)4 78 78 79 03
>
> http://bu.univ-lyon3.fr<http://bu.univ-lyon3.fr/>| Suivez-nous > Facebook<
> https://www.facebook.com/bulyon3/> | Twitter<https://twitter.com/bulyon3>|
> Instagram<https://www.instagram.com/bu.lyon3/?hl=fr>
>
> _______________________________________________
>
> Koha mailing list  http://koha-community.org Koha at lists.katipo.co.nz
> <mailto:Koha at lists.katipo.co.nz>
> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.koha-community.org/pipermail/koha-devel/attachments/20221026/d7712779/attachment-0001.htm
> >
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Koha-devel mailing list
> Koha-devel at lists.koha-community.org
> https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
> website : https://www.koha-community.org/ git :
> https://git.koha-community.org/ bugs : https://bugs.koha-community.org/
>
>
> ------------------------------
>
> End of Koha-devel Digest, Vol 203, Issue 15
> *******************************************
> _______________________________________________
>
> Koha mailing list  http://koha-community.org
> Koha at lists.katipo.co.nz
> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
>


More information about the Koha mailing list