[Koha] Koha Digest, Vol 219, Issue 7 - how to avoid high cpu uses due to web crawlers (vinod mishra)

Amar Londhe amar at ourlib.in
Mon Jan 15 20:16:22 NZDT 2024


Hello,
I've observed a concerning issue with our Koha server, where multiple 
bots are causing downtime and significantly increasing CPU usage. Some 
of the problematic bots include:

  * PetalBot;+https://webmaster.petalsearch.com/site/petalbot
  * MJ12bot/v1.4.8; http://mj12bot.com/
  * SemrushBot/7~bl; +http://www.semrush.com/bot.html

Despite attempting to address this by adding a robots.txt file, it 
hasn't proven effective in preventing these bots from causing 
disruptions. Additionally, the dynamic nature of IP addresses makes it 
challenging to block them individually.

Furthermore, I've noticed that the Apache2 server is generating internal 
requests, and I'm uncertain about the cause and purpose of these requests.
` ::1 - - [15/Jan/2024:12:40:41 +0530] "OPTIONS * HTTP/1.0" 200 126 "-" 
"Apache/2.4.41 (Ubuntu) OpenSSL/1.1.1f (internal dummy connection)" `


I need your expertise to fix the bot issues impacting server 
performance, high CPU usage,  and prevent unauthorized internal requests.

Thanks and Regards,
Amar Londhe
Full-Stack Developer

On 11/01/24 4:30 am, koha-request at lists.katipo.co.nz wrote:
> Send Koha mailing list submissions to
> 	koha at lists.katipo.co.nz
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://lists.katipo.co.nz/mailman/listinfo/koha
> or, via email, send a message with subject or body 'help' to
> 	koha-request at lists.katipo.co.nz
>
> You can reach the person managing the list at
> 	koha-owner at lists.katipo.co.nz
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Koha digest..."
>
>
> Today's Topics:
>
>     1. how to avoid high cpu uses due to web crawlers (vinod mishra)
>     2. Re: how to avoid high cpu uses due to web crawlers
>        (Nirmit Krishnatray)
>     3. Re: how to avoid high cpu uses due to web crawlers (vinod mishra)
>     4. Re: how to avoid high cpu uses due to web crawlers
>        (Wagner, Alexander)
>     5. koha-US Board Meeting Minutes for January 10, 2024
>        (Kristi Krueger)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 10 Jan 2024 12:51:38 +0530
> From: vinod mishra<mishravk79 at gmail.com>
> To: Koha<Koha at lists.katipo.co.nz>
> Subject: [Koha] how to avoid high cpu uses due to web crawlers
> Message-ID:
> 	<CAGLUwiRDAsH66xeQoznjHXxEiGiiGvdTv8P7w5uihM3H93mU2g at mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> Hello
>
> I found that an IP 47.76.35.19 is hitting my opac continuously, due to
> which CPU use is very high, and it makes the entire Koha opac and staff
> client very slow.
>
> I tried following the links but could not resolve the issue.
>
> https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=4042#c3
> https://wiki.koha-community.org/wiki/Koha_Tuning_Guide
>
> I am also not able to locate the file .htaccess in mu ubuntu 18.04 with
> koha 20.04
> Can anyone how to resolve this?
>
> With Regards,
>
> Vinod Kumar Mishra,
> (Ph.D, MLISC, MA, B.Sc, DCA)
> Assistant Librarian,
> Biju Patnaik Central Library (BPCL),
> NIT Rourkela,
> Sundergadh-769008,
> Odisha,
> India.
> Mob:91+9439420860
> URL:https://vinod.itshelp.co.in/  <http://vinod.itshelp.co.in/>
> ORCID ID:https://orcid.org/0000-0003-4666-7874
> <http://orcid.org/0000-0003-4666-7874>
> Scopus ID: 57223138343
>
> *"Spiritual relationship is far more precious than physical. Physical
> relationship divorced from spiritual is body without soul" -- Mahatma
> Gandhi*
>
>
> ------------------------------
>
> Message: 2
> Date: Wed, 10 Jan 2024 07:46:17 +0000
> From: Nirmit Krishnatray<nirmit at edutech.com>
> To: vinod mishra<mishravk79 at gmail.com>, Koha
> 	<Koha at lists.katipo.co.nz>
> Subject: Re: [Koha] how to avoid high cpu uses due to web crawlers
> Message-ID:<27ebb0c12a164436a0b59c8be7e46401 at edutech.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi  sir,
>
> Try to block the ip that is hitting on your server.
>
> Best Regards
> Nirmit Krishnatray | Associate Manager - Professional Services
> DBS Business Center, World Trade Tower, Barakhamba Lane,Connaught Place,
> New Delhi – 110001
> M: +91 9003078515 | E:nirmit at edutech.com
> Edutech India  | LinkedIn  | Twitter  |  Facebook  |  Youtube
>
>
> -----Original Message-----
> From: Koha [mailto:koha-bounces at lists.katipo.co.nz] On Behalf Of vinod mishra
> Sent: 10 January 2024 12:52
> To: Koha<Koha at lists.katipo.co.nz>
> Subject: [Koha] how to avoid high cpu uses due to web crawlers
>
> Hello
>
> I found that an IP 47.76.35.19 is hitting my opac continuously, due to which CPU use is very high, and it makes the entire Koha opac and staff client very slow.
>
> I tried following the links but could not resolve the issue.
>
> https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=4042#c3
> https://wiki.koha-community.org/wiki/Koha_Tuning_Guide
>
> I am also not able to locate the file .htaccess in mu ubuntu 18.04 with koha 20.04 Can anyone how to resolve this?
>
> With Regards,
>
> Vinod Kumar Mishra,
> (Ph.D, MLISC, MA, B.Sc, DCA)
> Assistant Librarian,
> Biju Patnaik Central Library (BPCL),
> NIT Rourkela,
> Sundergadh-769008,
> Odisha,
> India.
> Mob:91+9439420860
> URL:https://vinod.itshelp.co.in/  <http://vinod.itshelp.co.in/>  ORCID ID:https://orcid.org/0000-0003-4666-7874
> <http://orcid.org/0000-0003-4666-7874>
> Scopus ID: 57223138343
>
> *"Spiritual relationship is far more precious than physical. Physical relationship divorced from spiritual is body without soul" -- Mahatma
> Gandhi*
> _______________________________________________
>
> Koha mailing listhttp://koha-community.org  Koha at lists.katipo.co.nz
> Unsubscribe:https://lists.katipo.co.nz/mailman/listinfo/koha
>
> ------------------------------
>
> Message: 3
> Date: Wed, 10 Jan 2024 13:23:31 +0530
> From: vinod mishra<mishravk79 at gmail.com>
> To: Nirmit Krishnatray<nirmit at edutech.com>
> Cc: Koha<Koha at lists.katipo.co.nz>
> Subject: Re: [Koha] how to avoid high cpu uses due to web crawlers
> Message-ID:
> 	<CAGLUwiTpOGzFrz84VAX5BZvSa7ejZ+riVFmtDEVuy1c-QTGU0A at mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> Thanks that is the ultimate solution, looking for any other effective
> solution too if faced in future. Creating robots.txt file seems easy but
> finding which crawler is hitting is difficult with IP
>
> On Wed, 10 Jan, 2024, 13:16 Nirmit Krishnatray,<nirmit at edutech.com>  wrote:
>
>> Hi  sir,
>>
>> Try to block the ip that is hitting on your server.
>>
>> Best Regards
>> Nirmit Krishnatray | Associate Manager - Professional Services
>> DBS Business Center, World Trade Tower, Barakhamba Lane,Connaught Place,
>> New Delhi – 110001
>> M: +91 9003078515 | E:nirmit at edutech.com
>> Edutech India  | LinkedIn  | Twitter  |  Facebook  |  Youtube
>>
>>
>> -----Original Message-----
>> From: Koha [mailto:koha-bounces at lists.katipo.co.nz] On Behalf Of vinod
>> mishra
>> Sent: 10 January 2024 12:52
>> To: Koha<Koha at lists.katipo.co.nz>
>> Subject: [Koha] how to avoid high cpu uses due to web crawlers
>>
>> Hello
>>
>> I found that an IP 47.76.35.19 is hitting my opac continuously, due to
>> which CPU use is very high, and it makes the entire Koha opac and staff
>> client very slow.
>>
>> I tried following the links but could not resolve the issue.
>>
>> https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=4042#c3
>> https://wiki.koha-community.org/wiki/Koha_Tuning_Guide
>>
>> I am also not able to locate the file .htaccess in mu ubuntu 18.04 with
>> koha 20.04 Can anyone how to resolve this?
>>
>> With Regards,
>>
>> Vinod Kumar Mishra,
>> (Ph.D, MLISC, MA, B.Sc, DCA)
>> Assistant Librarian,
>> Biju Patnaik Central Library (BPCL),
>> NIT Rourkela,
>> Sundergadh-769008,
>> Odisha,
>> India.
>> Mob:91+9439420860
>> URL:https://vinod.itshelp.co.in/  <http://vinod.itshelp.co.in/>  ORCID ID:
>> https://orcid.org/0000-0003-4666-7874
>> <http://orcid.org/0000-0003-4666-7874>
>> Scopus ID: 57223138343
>>
>> *"Spiritual relationship is far more precious than physical. Physical
>> relationship divorced from spiritual is body without soul" -- Mahatma
>> Gandhi*
>> _______________________________________________
>>
>> Koha mailing listhttp://koha-community.org  Koha at lists.katipo.co.nz
>> Unsubscribe:https://lists.katipo.co.nz/mailman/listinfo/koha
>>
>
> ------------------------------
>
> Message: 4
> Date: Wed, 10 Jan 2024 10:19:55 +0100 (CET)
> From: "Wagner, Alexander"<alexander.wagner at desy.de>
> To: vinod mishra<mishravk79 at gmail.com>
> Cc: Koha<Koha at lists.katipo.co.nz>
> Subject: Re: [Koha] how to avoid high cpu uses due to web crawlers
> Message-ID:<2002597982.8686470.1704878395276.JavaMail.zimbra at desy.de>
> Content-Type: text/plain; charset=utf-8
>
> Hi!
>
>> I found that an IP 47.76.35.19 is hitting my opac continuously, due to
>> which CPU use is very high, and it makes the entire Koha opac and staff
>> client very slow.
> This does not look like a legit crawler. So most likely you can't tackle this guy with a robots.txt as most likely it will not respect it anyway.
>
>> I am also not able to locate the file .htaccess in mu ubuntu 18.04 with
>> koha 20.04
>> Can anyone how to resolve this?
> `.htaccess` files do not exist by default, you'd have to create it in the appropriate place with proper permissions and ownerships using your favourite text-editor. They are basically folder based firewall rules read by your webserver. IOW you could either use those or have a rule in your apache configs.
>
> I am no expert in either but on one of our current (non-koha)-systems we use something like
>
> ```
>
> # Turn badips away
> RewriteMap hosts-deny "txt:/opt/invenio/var/tmp/hosts-deny.txt"
> RewriteCond   "${hosts-deny:%{REMOTE_ADDR}|NOT-FOUND}" "!=NOT-FOUND" [OR]
> RewriteCond   "${hosts-deny:%{HTTP:X-Forwarded-For}|NOT-FOUND}" "!=NOT-FOUND"
> RewriteRule .* - [R=429,L]
>
> ```
>
> in the apache configs. This refers to a txt-file in this case in some funny path `/opt/invenio/var/tmp/` called `hosts-deny.txt` that lists the ip-addresses that should be dropped. You could in principle create such a file in some place your apache can see it. This makes it a bit easier to handle unwanted "crawlers" as you just add the offending ips there.
>
> HTH.
>


More information about the Koha mailing list