[Koha] OAI-PMH harvester
BOUIS Sonia
sonia.bouis at univ-lyon3.fr
Tue Jan 24 05:47:03 NZDT 2023
Hi,
Just to let you know that during the KohaLa hackathon (until wednesday), we are thinking about the OAI-PMH harvester. I add our first thoughts on the BZ ticket : https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662#c318
Kind regards,
Sonia
-----Message d'origine-----
De : Arthur Suzuki [mailto:arthur.suzuki at biblibre.com]
Envoyé : mercredi 23 novembre 2022 14:44
À : Mike D. <black23 at gmail.com>; BOUIS Sonia <sonia.bouis at univ-lyon3.fr>
Cc : koha at lists.katipo.co.nz; koha-devel at lists.koha-community.org
Objet : Re: [Koha] OAI-PMH harvester
Hello there,
If I may suggest a good harvester library, Catmandu may do the job pretty well.
I've not used the OAI module but used it to harvest from a JSON source and transform to an UNIMARC file with pretty good success so far.
It can export seamlessly to iso2709 or marcxml.
https://metacpan.org/dist/Catmandu-OAI
Best,
Arthur
On 2022-11-22 15:57, Mike D. wrote:
> Hey. Hey,
> I'm really glad to see the OAI-PMH harvester debate going on for Koha.
> I
> think if we choose a good external harvester with support, we can save
> a lot of energy and resources to implement related activities in the
> system.
> Shoveling the logs is only part of the story. The easy part. Since the
> result of shoveling is a lot of records, most of the time we can't
> avoid post-processing, merging with the records in the local database.
> For example, if you need to update records from a source where there
> are millions of records, but there are hundreds of thousands in the
> local database. Only a slice of that huge amount is relevant. If we
> design the processing workflow wrong, it will take unnecessarily long
> and burn valuable resources.
> I would hereby like to invite us to be in touch, to debate and share
> our experiences. Let's get this area moving towards a successful
> finish.
>
> Take care.
>
> Michal
>
> út 22. 11. 2022 v 15:13 odesílatel BOUIS Sonia
> <sonia.bouis at univ-lyon3.fr>
> napsal:
>
>> Hi,
>> Thanks to David, Tomas, Michal and Michael for your replies.
>>
>> So we have decided to evaluate several external OAI-PMH client that
>> could be used by Koha and to choose one in the end of January There a
>> lot to do after that and we discussed about the background jobs and
>> cronjobs seems to be appropriate. We thought that the settings in the
>> koha intranet should be only to define URLs, SETs, or XSLT sheets
>> (for example, to transform DC XML in MARCXML).
>>
>> We are only at the begining of the process 😊
>>
>> Kind regards,
>> Sonia
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Wed, 26 Oct 2022 10:37:49 +1100
>> From: "David Cook" <dcook at prosentient.com.au>
>> To: "'Tomas Cohen Arazi'" <tomascohen at gmail.com>, "'BOUIS Sonia'"
>> <sonia.bouis at univ-lyon3.fr>
>> Cc: "'koha'" <koha at lists.katipo.co.nz>, "'koha-devel'"
>> <koha-devel at lists.koha-community.org>
>> Subject: Re: [Koha-devel] [Koha] OAI-PMH harvester
>> Message-ID: <07af01d8e8ca$dfbddef0$9f399cd0$@prosentient.com.au>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi Sonia,
>>
>>
>>
>> I’m excited to hear that KohaLA would like to finance an OAI-PMH
>> client in Koha! This functionality is always brewing in the back of
>> my mind, since I first raised 10662 back in 2013.
>>
>>
>>
>> As Tomas says, I think that the background jobs are a key component
>> for processing incoming OAI-PMH records.
>>
>>
>>
>> However, the ***missing component right now is the scheduling of the
>> OAI-PMH harvesting tasks***, and I think this is where opinions get
>> divided. Below, I’ll provide some history and opinions on Koha
>> OAI-PMH.
>>
>>
>>
>> --
>>
>>
>>
>> With 10662, the sponsored goal was for Koha library staff to schedule
>> OAI-PMH harvests through the Web UI. However, Fridolin from BibLibre
>> raised a point with me at Kohacon18 about how letting library staff
>> control the timing of harvesting tasks could be a problem for support
>> vendors. If too many libraries using the same public IP address tried
>> to harvest from the same OAI-PMH repository, they could be rate
>> limited or blocked. There could also be server load concerns. So
>> there probably needs to be a balance between user configuration and
>> system configuration. If I recall correctly, this is how DSpace’s
>> OAI-PMH harvester works. Users set up targets and can start/stop
>> harvests, but things like frequency and concurrency are handled by
>> the system configuration.
>>
>>
>>
>> Based on my experience working on OAI-PMH on and off for nearly 10
>> years and as a Koha support vendor, I think my preference would be
>> for sysadmins to handle most of the OAI-PMH harvesting details.
>>
>>
>>
>> The sponsorship for 10662 had certain requirements that many other
>> libraries might not have, which is what made me think that it might
>> be better to have an external client that connects to Koha. I thought
>> maybe I could get the ordinary requirements pushed into Koha, and
>> then handle extraordinary requirements externally. However, an
>> external harvester won’t perform as fast as an internal harvester.
>> (The compromise would be to write the harvester in such a way that
>> people could provide different OAI-PMH harvester Perl modules that
>> all stage records using the same core Koha
>> modules.)
>>
>>
>>
>> Even then… the scheduling would depend on a library’s needs. Back in
>> 2013, I had a Koha OAI-PMH harvester which worked as a cronjob. It
>> would run each night. However, some libraries want to run OAI-PMH
>> harvests as frequently as every 3 seconds. A cronjob’s smallest
>> frequency is 60 seconds, so that wouldn’t work for that requirement.
>>
>>
>>
>> If a cronjob isn’t suitable, then I think you’d need a daemon created
>> by a new command like “koha-oai --start <instance_name>”. It could
>> read a configuration file and handle scheduling accordingly. With
>> 10662, I used the POE module, because I knew it well and it has some
>> timer tools for scheduling tasks. If I were to work on it again, I’d
>> probably use Mojo::IOLoop instead these days, since Mojolicious is
>> already part of Koha while POE is not. (That said, using modules like
>> Mojo and POE are difficult, because they’re difficult to test using
>> automation. That was one of the stumbling blocks with 10662. While
>> the 10662 harvester worked very well, it was difficult to unit test.
>> In hindsight, I should’ve written it in a way that was easier to unit
>> test, but it had a lot of event-driven code which made things more
>> difficult.)
>>
>>
>>
>> Another option would be to create a generic daemon for task
>> scheduling in general (e.g. “koha-schedule”). Koha could use this for
>> many things, but it’s a project in itself.
>>
>>
>>
>> --
>>
>>
>>
>> The process of downloading OAI-PMH records and importing MARCXML into
>> Koha is actually a fairly straightforward process. The difficulty is
>> the task scheduling and management of tasks (and unit testing).
>>
>>
>>
>> I don’t know the answer that will make everyone happy. There’s lots
>> of different ways of managing and scheduling the tasks. Based on my
>> experience, I’d suggest targeting the simplest approach first,
>> because complexity will make it less likely for the project to succeed.
>>
>>
>>
>> On that note, I’d be happy to test/QA any OAI-PMH harvester put
>> forward.
>> When I was writing OAI-PMH harvester patches, I found it really hard
>> to get QA, so I’m happy to be that resource for someone else. I’ve
>> spent a lot of time thinking about this topic, so happy to provide
>> advice, warnings, emotional support 😉.
>>
>>
>>
>> David Cook
>>
>> Senior Software Engineer
>>
>> Prosentient Systems
>>
>> Suite 7.03
>>
>> 6a Glen St
>>
>> Milsons Point NSW 2061
>>
>> Australia
>>
>>
>>
>> Office: 02 9212 0899
>>
>> Online: 02 8005 0595
>>
>>
>>
>> From: Koha-devel <koha-devel-bounces at lists.koha-community.org> On
>> Behalf Of Tomas Cohen Arazi
>> Sent: Wednesday, 26 October 2022 3:46 AM
>> To: BOUIS Sonia <sonia.bouis at univ-lyon3.fr>
>> Cc: koha <koha at lists.katipo.co.nz>; koha-devel <
>> koha-devel at lists.koha-community.org>
>> Subject: Re: [Koha-devel] [Koha] OAI-PMH harvester
>>
>>
>>
>> I think with background jobs we have most of the framework that is
>> needed to deal with this within Koha.
>>
>>
>>
>> Best regards
>>
>>
>>
>> El mar, 25 oct 2022 7:08, BOUIS Sonia <sonia.bouis at univ-lyon3.fr
>> <mailto:
>> sonia.bouis at univ-lyon3.fr> > escribió:
>>
>> Hi,
>> KohaLA would like to finance an OAI-PMH client in Koha but, we have
>> questions that we want to raise to the community.
>> There was already tries to propose an OAI-PMH client :
>> - https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662 :
>> it's
>> an old project that doesnt seem compatible with the current version
>> of Koha
>> - https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=25905 :
>> the
>> scope is more to use an external OAI-PMH client and to connect it to
>> Koha
>>
>> Our main question is about the way to handle this. Do you think that
>> it's a better idea to use an external software or PERL routine and to
>> find a way to connect it to Koha. Or would it be better to a new
>> module in Koha from scratch and that Koha have his own OAI-PMH
>> client.
>>
>> Please, let us hear your toughts about this projet.
>>
>> Kind regards
>>
>> Sonia
>>
>> Sonia BOUIS
>> ------------------------------------------------------
>> Responsable du Service informatique documentaire Département d'Appui
>> à la Recherche et aux Projets (DARP) Bibliothèques universitaires
>> Université Jean Moulin Lyon 3 ADRESSE GÉOGRAPHIQUE > Manufacture des
>> Tabacs | 6 cours Albert Thomas | LYON 8e ADRESSE POSTALE >
>> Bibliothèque de la Manufacture | 1C avenue des Frères Lumière | CS
>> 78242 - 69372 LYON CEDEX 08
>>
>> Ligne directe : 33 (0)4 78 78 79 03
>>
>> http://bu.univ-lyon3.fr<http://bu.univ-lyon3.fr/>| Suivez-nous >
>> Facebook< https://www.facebook.com/bulyon3/> |
>> Twitter<https://twitter.com/bulyon3>|
>> Instagram<https://www.instagram.com/bu.lyon3/?hl=fr>
>>
>> _______________________________________________
>>
>> Koha mailing list http://koha-community.org Koha at lists.katipo.co.nz
>> <mailto:Koha at lists.katipo.co.nz>
>> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
>>
>> -------------- next part -------------- An HTML attachment was
>> scrubbed...
>> URL: <
>> http://lists.koha-community.org/pipermail/koha-devel/attachments/2022
>> 1026/d7712779/attachment-0001.htm
>> >
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> Koha-devel mailing list
>> Koha-devel at lists.koha-community.org
>> https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
>> website : https://www.koha-community.org/ git :
>> https://git.koha-community.org/ bugs :
>> https://bugs.koha-community.org/
>>
>>
>> ------------------------------
>>
>> End of Koha-devel Digest, Vol 203, Issue 15
>> *******************************************
>> _______________________________________________
>>
>> Koha mailing list http://koha-community.org Koha at lists.katipo.co.nz
>> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
>>
> _______________________________________________
>
> Koha mailing list http://koha-community.org Koha at lists.katipo.co.nz
> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
--
Arthur Suzuki, 🌈🏔️
Développeur @BibLibre
More information about the Koha
mailing list