[Koha] Problems with the facebook web crawler

Hector Gonzalez Jaime cacho at genac.org
Fri Jul 26 14:39:44 NZST 2024


You might skip mod_sec and do the detection with fail2ban's 
apache-badbots, by changing its regex to  (the spaces ARE important, 
copy and paste that):

failregex = ^(?:\S+:\d+ )?<ADDR> [^"]*"[A-Z]+ [^"]+" \d+ \d+ "[^"]*" 
"[^"]*(?:<badbots>|<badbotscustom>)[^"]*"

adding the bad bots to the start of the "badbots" regex like:

badbots = 
meta-externalagent|facebookexternalhit|SemrushBot|amazonbot|AmazonBot|ClaudeBot|claudebot|Atomic_Email_Hunter/4\.0| 
... rest of the regex stays here.

and adding a jail like this:

[apache-badbots]
enabled = true
port     = http,https
filter   = apache-badbots
bantime  = 48h
logpath  = %(apache_access_log)s
maxretry = 1

[apache-badbots2]
enabled = true
port     = http,https
filter   = apache-badbots
bantime  = 48h
logpath  = /var/log/koha/USEYOURKOHASITENAMEHERE/plack.log
maxretry = 1

On 7/25/24 10:15, Indranil Das Gupta wrote:
> Hi Nigel,
>
> My solution for that is simple two step process:
>
> 1) using mod_sec to monitor and match the UA string of the incoming request
> against a list of UAs I don't want and return a HTTP 406 if the UA matches
> for the first time.
>
> 2) Have fail2ban monitor the apache log for 406 and immediately ban the IP
> (IPv4 / IPv6) for 96 hours using an apache-badbots jail.
>
> This strategy has so far managed to keep my servers "cool".
>
> cheers
> -idg
>
>
> On Thu, Jul 25, 2024, 16:57 Nigel Titley<nigel at titley.com>  wrote:
>
>> Is anyone else getting problems with the facebook web crawler hammering
>> their OPAC search function?
>>
>> This has been happening on and off for a couple of months but set in
>> with a vengeance a couple of days ago. The crawler is hitting us with
>> many OPAC search queries, beyond the capacity of our system to respond.
>>
>> robots.txt is being ignored
>>
>> I started by blocking facebook's entire IPv6 range as the queries were
>> all coming in over IPv6. They responded by switching to IPv4 and because
>> they have a number of blocks it wasn't practical to block each and every
>> one of them.
>>
>> I've temporarily switched off OPAC entirely and the system has returned
>> to normal and I can at least perform intranet functions but this is
>> obviously non-ideal.
>>
>> Does anyone have any thoughts on this?
>>
>> I'm running 22.05.13.000 on Ubuntu.
>>
>> Thanks
>>
>> Nigel
>> _______________________________________________
>>
>> Koha mailing listhttp://koha-community.org
>> Koha at lists.katipo.co.nz
>> Unsubscribe:https://lists.katipo.co.nz/mailman/listinfo/koha
>>
> _______________________________________________
>
> Koha mailing listhttp://koha-community.org
> Koha at lists.katipo.co.nz
> Unsubscribe:https://lists.katipo.co.nz/mailman/listinfo/koha

-- 
Hector Gonzalez
cacho at genac.org


More information about the Koha mailing list