[Koha] Block web crawlers Bots

Fri Jan 28 22:50:39 NZDT 2022

Hi,

well, the robots.txt is just an information given to bots that you don't wish your web server to be crawled. It is not a technical block, so bots can choose to ignore the robots.txt, and at least those not compliant with rules and standards will do so.

You would need a technical solution to exclude these IPs, e.g. via your institution's firewall. You could also configure your Apache to exclude them. However, this needs of course constant maintenance to be up to date. 
A better way would be to exclude IPs according to certain parameters, e.g. if they contact your web server > n times within a given timeslot.

You would have to look out for solutions for "ratelimiting". There seem to be Apache modules available for this, which might be of use. In another - not Koha - context, mod_qos was recommended. However, I wouldn't know which solution is best with Koha. I am unfortunately not so much into this, so please discuss this with a web server techie.

Hope this is of use anyway,

Regards, Anke

-- 
Anke Bruns M.A. (LIS)
Arbeitsgruppe "Anwendungs- und Informationssysteme"
E-Mail: anke.bruns at gwdg.de
---------------------------------------
Achtung! Neue Kontaktdaten!

Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Burckhardtweg 4, 37077 Göttingen, URL: https://gwdg.de

Support: Tel.: +49 551 39-30000, URL: https://gwdg.de/support
Sekretariat: Tel.: +49 551 39-30001, E-Mail: gwdg at gwdg.de 

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau 
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598
---------------------------------------
Zertifiziert nach ISO 9001
---------------------------------------

> -----Ursprüngliche Nachricht-----
> Von: Koha <koha-bounces at lists.katipo.co.nz> Im Auftrag von JITHIN N
> Gesendet: Freitag, 28. Januar 2022 06:05
> An: koha at lists.katipo.co.nz
> Betreff: [Koha] Block web crawlers Bots
> 
> My server memory usage is increasing by some web search engine bots by
> access of OPAC. I tried put
> 
> *robots.txt in *
> 
> on /usr/share/koha/opac/htdocs
> 
> As
> 
> *User-agent: **
> 
> *Disallow: /*
> 
> But still, my server is getting loaded by bots. The apache log file shows
> bots names like petalbot, Googlebot, AhrefsBot etc. How can I block all
> these bots' access to my OPAC pages?
> 
> With Regards
> 
> Jithin N
> 
> *System Information*
> 
> Koha 21.05
> 
> Deian 10
> _______________________________________________
> 
> Koha mailing list  http://koha-community.org
> Koha at lists.katipo.co.nz
> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha