Koha slowed down by Google indexing?!

Michael Kuhn

3 May 2017 3 May '17

1:24 p.m.

Hi We're on Debian GNU/Linux 8 running Koha 16.11.04. Today the library complained that Koha was acting very slow - a search in the OPAC takes about 10 to 30 seconds. With command "top" the following can be seen: top - 13:06:31 up 2:44, 3 users, load average: 3.14, 2.54, 2.41 Tasks: 162 total, 5 running, 157 sleeping, 0 stopped, 0 zombie %Cpu0 : 95.7 us, 4.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 95.7 us, 4.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st KiB Mem: 4049620 total, 2377480 used, 1672140 free, 218872 buffers KiB Swap: 4189180 total, 6632 used, 4182548 free. 813584 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19678 phsh-ko+ 20 0 224672 83616 8996 R 43.2 2.1 0:01.59 opac-searc+ 19672 phsh-ko+ 20 0 393696 148372 12304 R 41.2 3.7 0:03.09 opac-searc+ 19682 phsh-ko+ 20 0 211028 69912 8860 R 28.9 1.7 0:00.87 opac-searc+ 19684 phsh-ko+ 20 0 168824 42320 8496 R 14.3 1.0 0:00.43 opac-searc+ ... Most of the time, "opac-search.pl" is eating up 95-100% of the CPU and more than one process can be seen at the same time. Eventually I found many lines like the following in the file "opac-error.log": [Wed May 03 13:13:10.796794 2017] [cgi:error] [pid 20053] [client 66.249.64.32:39077] AH01215: [Wed May 3 13:13:10 2017] opac-detail.pl: Use of uninitialized value in subroutine entry at /usr/share/perl5/URI/Escape.pm line 184. [Wed May 03 13:13:17.600461 2017] [cgi:error] [pid 20479] [client 66.249.64.32:50544] AH01215: [Wed May 3 13:13:17 2017] opac-ISBDdetail.pl: Use of uninitialized value in subroutine entry at /usr/share/perl5/URI/Escape.pm line 184. When I searched for who is 66.249.64.32 I saw this IP addresse belongs to Google. When I added some functionality to show the execution time of "opac-search.pl" I got many more lines like the following, all of them indicating to IP address 66.249.64.32: [Wed May 03 13:10:01.030216 2017] [cgi:error] [pid 20053] [client 66.249.64.32:39077] AH01215: START: 1493809799.15058 / END: 1493809801.02982 = 1.87923502922058 [Wed May 03 13:10:04.426503 2017] [cgi:error] [pid 20147] [client 104.154.58.95:33362] AH01215: START: 1493809799.83011 / END: 1493809804.42624 = 4.59613108634949 [Wed May 03 13:10:06.473646 2017] [cgi:error] [pid 20053] [client 66.249.64.32:39077] AH01215: START: 1493809805.1977 / END: 1493809806.47357 = 1.27586483955383 I'm not sure what this means. Is this Google indexing our library catalog and thus slowing everything down? Do you have experienced similar behaviour or can someone please explain what's happening? Best wishes: Michael

Show replies by date

Mark Alexander

3 May 3 May

1:49 p.m.

...

When I searched for who is 66.249.64.32 I saw this IP addresse belongs to Google.

This does seem to be the Google indexer: % nslookup 66.249.64.32 ... 32.64.249.66.in-addr.arpa name = crawl-66-249-64-32.googlebot.com. I haven't seen this problem (yet), but perhaps that is because I have a /usr/share/koha/opac/htdocs/robots.txt containing this: Crawl-delay: 60 User-agent: * Disallow: / User-agent: Googlebot Disallow: /cgi-bin/koha/opac-search.pl Disallow: /cgi-bin/koha/opac-showmarc.pl Disallow: /cgi-bin/koha/opac-detailprint.pl Disallow: /cgi-bin/koha/opac-ISBDdetail.pl Disallow: /cgi-bin/koha/opac-MARCdetail.pl Disallow: /cgi-bin/koha/opac-reserve.pl Disallow: /cgi-bin/koha/opac-export.pl Disallow: /cgi-bin/koha/opac-detail.pl Disallow: /cgi-bin/koha/opac-authoritiesdetail.pl

Hugo Agud

1:52 p.m.

Hi Yes this is annoying issue with boots, this is google but there are plenty of them... You should use robots.txt propertly, but If I am not wrong with Google it is more effective go to google webmaster web and modify the googleboot behaviour with your koha installarion You should also use a koha-sitemap.. depending on the version is out of the box functionality Perhaps you may think on use ufw or even ufw + fail2ban Some times bots are nightmare 2017-05-03 13:49 GMT+02:00 Mark Alexander <marka@pobox.com>:

...

...
When I searched for who is 66.249.64.32 I saw this IP addresse belongs to Google.

This does seem to be the Google indexer:

% nslookup 66.249.64.32 ... 32.64.249.66.in-addr.arpa name = crawl-66-249-64-32.googlebot.com.

I haven't seen this problem (yet), but perhaps that is because I have a /usr/share/koha/opac/htdocs/robots.txt containing this:

Crawl-delay: 60

User-agent: * Disallow: /

User-agent: Googlebot Disallow: /cgi-bin/koha/opac-search.pl Disallow: /cgi-bin/koha/opac-showmarc.pl Disallow: /cgi-bin/koha/opac-detailprint.pl Disallow: /cgi-bin/koha/opac-ISBDdetail.pl Disallow: /cgi-bin/koha/opac-MARCdetail.pl Disallow: /cgi-bin/koha/opac-reserve.pl Disallow: /cgi-bin/koha/opac-export.pl Disallow: /cgi-bin/koha/opac-detail.pl Disallow: /cgi-bin/koha/opac-authoritiesdetail.pl _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz https://lists.katipo.co.nz/mailman/listinfo/koha

-- *Hugo Agud - Orex Digital * *www.orex.es <http://www.orex.es>* <http://www.orex.es/> [image: www.orex.es/koha] <http://www.orex.es/koha> [image: www.orex.es/vufind] <http://www.orex.es/vufind> <http://www.orex.es/omeka> Director Calle Sant Joaquin,117, 2º-3ª · 08922 Santa Coloma de Gramanet - Tel: 933 856 138 hagud@orex.es · http://www.orex.es/ No imprima este mensaje a no ser que sea necesario. Una tonelada de papel implica la tala de 15 árboles y el consumo de 250.000 litros de agua. Aviso de confidencialidad Este mensaje contiene información que puede ser CONFIDENCIAL y/o de USO RESTRINGIDO. Si usted no es el receptor deseado del mensaje (ni está autorizado a recibirlo por el remitente), no está autorizado a copiar, reenviar o divulgar el mensaje o su contenido. Si ha recibido este mensaje por error, por favor, notifíquenoslo inmediatamente y bórrelo de su sistema.

Michael Kuhn

4:14 p.m.

Hi Mark and Hugo Many thanks for your hints! I have now done the following. 1. I created a file "/usr/share/koha/opac/htdocs/robots.txt" containing this: Sitemap: sitemapindex.xml User-agent: * Disallow: /cgi-bin/ 2. I generated a Koha sitemap using the seemingly undocumented Perl script "sitemap.pl" (according to https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=11190) which created the file "/usr/share/koha/opac/htdocs/sitemapindex.xml" and the file "/usr/share/koha/opac/htdocs/sitemap0001.xml" containing the URLs. 3. Even after a complete reboot of the host the "opac-search.pl" processes were still there, appearing immediately after the reboot! 4. I went to Google Webmaster Tools where I downloaded the HTML confirmation file "googleb56bd3db2af352b1.html" and placed it in "/usr/share/koha/opac/htdocs" as well. I also followed the steps given on the Wemaster Tools page, i. e. I called the URL and I confirmed the download. 5. Even after a complete reboot of the host the "opac-search.pl" processes were still there, appearing immediately after the reboot! 6. I then installed the Uncomplicated Firewall / UFW where I applied the following rules and enabled it: # ufw status Status: active To Action From -- ------ ---- 22/tcp ALLOW Anywhere 80/tcp ALLOW Anywhere 8080/tcp ALLOW Anywhere Anywhere DENY 66.249.64.32 But however this is possible, still Googlebot is crawling and eating CPU! This can be seen in the log file "plack.log" where hundreds and thousands of lines like the following can be seen: 66.249.64.32 - - [03/May/2017:15:48:28 +0200] "GET /opac/opac-authoritiesdetail.pl?authid=12872 HTTP/1.1" 200 17703 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" And I also found another bot: 62.138.14.218 - - [03/May/2017:15:48:29 +0200] "GET /opac/opac-search.pl?q=se,phr:%22Zeitreise%22 HTTP/1.1" 200 54672 "-" "Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)" Now what I don't understand is how Googlebot (66.249.64.32) can access the webserver even if it is blocked by UFW?! 9. Already quite desperate I finally executed the following line to drop all packets from 66.249.64.32. # iptables -I INPUT -s 66.249.64.32 -j DROP # iptables -I INPUT -s 62.138.14.218 -j DROP And yes - this actually stopped these harassing bots. But of course, next was this: 66.249.64.35 - - [03/May/2017:15:59:21 +0200] "GET /opac/opac-authoritiesdetail.pl?authid=16429 HTTP/1.1" 200 17661 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" I also dropped this IP address and now - finally! - the OPAC search for the normal user works as fast as expected. In fact I can't believe I should be the only one experiencing this behavior (especially since the stuff about creating "sitemap.pl" is quite hidden and however undocumented in the Koha manual). The other thing is people usually say it's a good thing to be indexed by Google. Today however, I won't agree. Maybe tomorrow, I will then try to delete the rule which drops the Google packets and I really hope Google will then do what it is told to do in "robots.txt", using the Koha sitemap. So all this just for the record - maybe it will help someone in the future. Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch

Hugo Agud

4:20 p.m.

Hi You 're not the only one who has suffered this from Google, but Baidu is worse and some others as well, giving you telegram answers to your points... Yes I have also suffered a lot from crawlers, and I have spend a lot of hours trying to adjut firewalls, robots.... What version of Koha you're using? moderns one have a command koha-sitemap (If I am not wrong) Google Webmaster warns you that it has not inmediate effect, you should wait a little more... In summary you have done all the expected work, now it is just time to ajust it and wait for the results With the combinations of robots.txt, koha-sitemap & firewall I have been happy for a long time... but you're not save from this never :( I am sorry.. 2017-05-03 16:14 GMT+02:00 Michael Kuhn <mik@adminkuhn.ch>:

...

Hi Mark and Hugo

Many thanks for your hints! I have now done the following.

1. I created a file "/usr/share/koha/opac/htdocs/robots.txt" containing this:

Sitemap: sitemapindex.xml User-agent: * Disallow: /cgi-bin/

2. I generated a Koha sitemap using the seemingly undocumented Perl script "sitemap.pl" (according to https://bugs.koha-community.or g/bugzilla3/show_bug.cgi?id=11190) which created the file "/usr/share/koha/opac/htdocs/sitemapindex.xml" and the file "/usr/share/koha/opac/htdocs/sitemap0001.xml" containing the URLs.

3. Even after a complete reboot of the host the "opac-search.pl" processes were still there, appearing immediately after the reboot!

4. I went to Google Webmaster Tools where I downloaded the HTML confirmation file "googleb56bd3db2af352b1.html" and placed it in "/usr/share/koha/opac/htdocs" as well. I also followed the steps given on the Wemaster Tools page, i. e. I called the URL and I confirmed the download.

5. Even after a complete reboot of the host the "opac-search.pl" processes were still there, appearing immediately after the reboot!

6. I then installed the Uncomplicated Firewall / UFW where I applied the following rules and enabled it:

# ufw status Status: active

To Action From -- ------ ---- 22/tcp ALLOW Anywhere 80/tcp ALLOW Anywhere 8080/tcp ALLOW Anywhere Anywhere DENY 66.249.64.32

But however this is possible, still Googlebot is crawling and eating CPU! This can be seen in the log file "plack.log" where hundreds and thousands of lines like the following can be seen:

66.249.64.32 - - [03/May/2017:15:48:28 +0200] "GET /opac/ opac-authoritiesdetail.pl?authid=12872 HTTP/1.1" 200 17703 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

And I also found another bot:

62.138.14.218 - - [03/May/2017:15:48:29 +0200] "GET /opac/ opac-search.pl?q=se,phr:%22Zeitreise%22 HTTP/1.1" 200 54672 "-" "Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)"

Now what I don't understand is how Googlebot (66.249.64.32) can access the webserver even if it is blocked by UFW?!

9. Already quite desperate I finally executed the following line to drop all packets from 66.249.64.32.

# iptables -I INPUT -s 66.249.64.32 -j DROP # iptables -I INPUT -s 62.138.14.218 -j DROP

And yes - this actually stopped these harassing bots.

But of course, next was this:

66.249.64.35 - - [03/May/2017:15:59:21 +0200] "GET /opac/ opac-authoritiesdetail.pl?authid=16429 HTTP/1.1" 200 17661 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

I also dropped this IP address and now - finally! - the OPAC search for the normal user works as fast as expected.

In fact I can't believe I should be the only one experiencing this behavior (especially since the stuff about creating "sitemap.pl" is quite hidden and however undocumented in the Koha manual).

The other thing is people usually say it's a good thing to be indexed by Google. Today however, I won't agree. Maybe tomorrow, I will then try to delete the rule which drops the Google packets and I really hope Google will then do what it is told to do in "robots.txt", using the Koha sitemap.

So all this just for the record - maybe it will help someone in the future.

Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch

Michael Kuhn

4:38 p.m.

Hi Hugo

...

You 're not the only one who has suffered this from Google, but Baidu is worse and some others as well, giving you telegram answers to your points...

Yes I have also suffered a lot from crawlers, and I have spend a lot of hours trying to adjut firewalls, robots....

What version of Koha you're using? moderns one have a command koha-sitemap (If I am not wrong)

I'm on Koha 16.11.04, and yes, there is a command "koha-sitemap". Unfortunately I couldn't find anything about it in the the Koha 16.11 manual. The only sources I found was: * https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=11190 * http://search.cpan.org/~fredericd/Koha-Contrib-Tamil-0.011/bin/koha-sitemap Since I couldn't find the command in https://wiki.koha-community.org/wiki/Commands_provided_by_the_Debian_package... I added it there in the new section "Bot-related". But I think it'd be good to propagate this better - or even activate it by default because who would want such behavior?

...

Google Webmaster warns you that it has not inmediate effect, you should wait a little more...

In summary you have done all the expected work, now it is just time to ajust it and wait for the results

With the combinations of robots.txt, koha-sitemap & firewall I have been happy for a long time... but you're not save from this never

:( I am sorry..

I'm getting more patient now since I was at least able to cure the symptoms... Best wishes & thanks again: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch

Mark Alexander

4:36 p.m.

Excerpts from Michael Kuhn's message of 2017-05-03 16:14:55 +0200:

...

# ufw status Status: active

To Action From -- ------ ---- 22/tcp ALLOW Anywhere 80/tcp ALLOW Anywhere 8080/tcp ALLOW Anywhere Anywhere DENY 66.249.64.32

But however this is possible, still Googlebot is crawling and eating CPU!

I haven't used UFW, but I'm looking at the documentation here: https://help.ubuntu.com/community/UFW and it seems that the order of the rules is important. Quote: Once a rule is matched the others will not be evaluated (see manual below) so you must put the specific rules first. As rules change you may need to delete old rules to ensure that new rules are put in the proper order. and from the man page: Rule ordering is important and the first match wins. Therefore when adding rules, add the more specific rules first with more general rules later.

Michael Kuhn

4:45 p.m.

Hi Mark

...

...
# ufw status Status: active

To Action From -- ------ ---- 22/tcp ALLOW Anywhere 80/tcp ALLOW Anywhere 8080/tcp ALLOW Anywhere Anywhere DENY 66.249.64.32

But however this is possible, still Googlebot is crawling and eating CPU!

I haven't used UFW, but I'm looking at the documentation here:

https://help.ubuntu.com/community/UFW

and it seems that the order of the rules is important. Quote:

Once a rule is matched the others will not be evaluated (see manual below) so you must put the specific rules first. As rules change you may need to delete old rules to ensure that new rules are put in the proper order.

and from the man page:

Rule ordering is important and the first match wins. Therefore when adding rules, add the more specific rules first with more general rules later.

Many thanks for the clarification! Yes, this makes sense. So I would have to delete all rules and write them again in the correct order? Very "uncomplicated" indeed ;-) However, I now deletd the rule for 66.249.64.32 in UFW since the rule in iptables succeeded (without giving any special order). Best wishes & thanks again: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch

Tomas Cohen Arazi

8:52 p.m.

The sitemapper tool is baked in Koha. The packages have a handy koha-sitemap script. Regards. El mié., 3 may. 2017 a las 11:45, Michael Kuhn (<mik@adminkuhn.ch>) escribió:

...

Hi Mark

...
...
# ufw status Status: active

To Action From -- ------ ---- 22/tcp ALLOW Anywhere 80/tcp ALLOW Anywhere 8080/tcp ALLOW Anywhere Anywhere DENY 66.249.64.32

But however this is possible, still Googlebot is crawling and eating CPU!

I haven't used UFW, but I'm looking at the documentation here:

https://help.ubuntu.com/community/UFW

and it seems that the order of the rules is important. Quote:

Once a rule is matched the others will not be evaluated (see manual below) so you must put the specific rules first. As rules change you may need to delete old rules to ensure that new rules are put in the proper order.

and from the man page:

Rule ordering is important and the first match wins. Therefore when adding rules, add the more specific rules first with more general rules later.

Many thanks for the clarification! Yes, this makes sense.

So I would have to delete all rules and write them again in the correct order? Very "uncomplicated" indeed ;-)

However, I now deletd the rule for 66.249.64.32 in UFW since the rule in iptables succeeded (without giving any special order).

Best wishes & thanks again: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz https://lists.katipo.co.nz/mailman/listinfo/koha

-- Tomás Cohen Arazi Theke Solutions (https://theke.io <http://theke.io/>) ✆ +54 9351 3513384 GPG: B2F3C15F

Magnus Enger

4 May 4 May

2:44 p.m.

On 3 May 2017 at 20:52, Tomas Cohen Arazi <tomascohen@gmail.com> wrote:

...

The sitemapper tool is baked in Koha. The packages have a handy koha-sitemap script.

And the documentation for it is available if you do this on the command line: $ man koha-sitemap Best regards, Magnus Libriotech

Michael Kuhn

4:49 p.m.

Hi Magnus

...

...
The sitemapper tool is baked in Koha. The packages have a handy koha-sitemap script.

And the documentation for it is available if you do this on the command line:

$ man koha-sitemap

Yes - but before, of course, the world needs to know there IS such a command. That's why I wrote in my e-mail from 3rd May 2017 16:38 the following: Unfortunately I couldn't find anything about it in the the Koha 16.11 manual. The only sources I found was: * https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=11190 * http://search.cpan.org/~fredericd/Koha-Contrib-Tamil-0.011/bin/koha-sitemap Since I couldn't find the command in https://wiki.koha-community.org/wiki/Commands_provided_by_the_Debian_package... I have now added "koha-sitemap" there in the new section "Bot-related". I still think it'd be good to propagate this better (for example mention it in the Koha manual) - or even activate it by default because who would want such behavior? Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch

Tomas Cohen Arazi

5:37 p.m.

It was mentioned on the release notes a while back. I agree that the divergence between source installs (what the manual usually talks about) and the packages has become a real problem. El jue., 4 may. 2017 a las 11:49, Michael Kuhn (<mik@adminkuhn.ch>) escribió:

...

Hi Magnus

...
...
The sitemapper tool is baked in Koha. The packages have a handy koha-sitemap script.

And the documentation for it is available if you do this on the command line:

$ man koha-sitemap

Yes - but before, of course, the world needs to know there IS such a command. That's why I wrote in my e-mail from 3rd May 2017 16:38 the following:

Unfortunately I couldn't find anything about it in the the Koha 16.11 manual. The only sources I found was:

* https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=11190 * http://search.cpan.org/~fredericd/Koha-Contrib-Tamil-0.011/bin/koha-sitemap

Since I couldn't find the command in

https://wiki.koha-community.org/wiki/Commands_provided_by_the_Debian_package... I have now added "koha-sitemap" there in the new section "Bot-related".

I still think it'd be good to propagate this better (for example mention it in the Koha manual) - or even activate it by default because who would want such behavior?

Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch _______________________________________________ Koha mailing list http://koha-community.org Koha@lists.katipo.co.nz https://lists.katipo.co.nz/mailman/listinfo/koha

-- Tomás Cohen Arazi Theke Solutions (https://theke.io <http://theke.io/>) ✆ +54 9351 3513384 GPG: B2F3C15F

3288

Age (days ago)

3289

Last active (days ago)

List overview

Download

11 comments

5 participants

participants (5)

Hugo Agud
Magnus Enger
Mark Alexander
Michael Kuhn
Tomas Cohen Arazi