Hi You 're not the only one who has suffered this from Google, but Baidu is worse and some others as well, giving you telegram answers to your points... Yes I have also suffered a lot from crawlers, and I have spend a lot of hours trying to adjut firewalls, robots.... What version of Koha you're using? moderns one have a command koha-sitemap (If I am not wrong) Google Webmaster warns you that it has not inmediate effect, you should wait a little more... In summary you have done all the expected work, now it is just time to ajust it and wait for the results With the combinations of robots.txt, koha-sitemap & firewall I have been happy for a long time... but you're not save from this never :( I am sorry.. 2017-05-03 16:14 GMT+02:00 Michael Kuhn <mik@adminkuhn.ch>:
Hi Mark and Hugo
Many thanks for your hints! I have now done the following.
1. I created a file "/usr/share/koha/opac/htdocs/robots.txt" containing this:
Sitemap: sitemapindex.xml User-agent: * Disallow: /cgi-bin/
2. I generated a Koha sitemap using the seemingly undocumented Perl script "sitemap.pl" (according to https://bugs.koha-community.or g/bugzilla3/show_bug.cgi?id=11190) which created the file "/usr/share/koha/opac/htdocs/sitemapindex.xml" and the file "/usr/share/koha/opac/htdocs/sitemap0001.xml" containing the URLs.
3. Even after a complete reboot of the host the "opac-search.pl" processes were still there, appearing immediately after the reboot!
4. I went to Google Webmaster Tools where I downloaded the HTML confirmation file "googleb56bd3db2af352b1.html" and placed it in "/usr/share/koha/opac/htdocs" as well. I also followed the steps given on the Wemaster Tools page, i. e. I called the URL and I confirmed the download.
5. Even after a complete reboot of the host the "opac-search.pl" processes were still there, appearing immediately after the reboot!
6. I then installed the Uncomplicated Firewall / UFW where I applied the following rules and enabled it:
# ufw status Status: active
To Action From -- ------ ---- 22/tcp ALLOW Anywhere 80/tcp ALLOW Anywhere 8080/tcp ALLOW Anywhere Anywhere DENY 66.249.64.32
But however this is possible, still Googlebot is crawling and eating CPU! This can be seen in the log file "plack.log" where hundreds and thousands of lines like the following can be seen:
66.249.64.32 - - [03/May/2017:15:48:28 +0200] "GET /opac/ opac-authoritiesdetail.pl?authid=12872 HTTP/1.1" 200 17703 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
And I also found another bot:
62.138.14.218 - - [03/May/2017:15:48:29 +0200] "GET /opac/ opac-search.pl?q=se,phr:%22Zeitreise%22 HTTP/1.1" 200 54672 "-" "Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)"
Now what I don't understand is how Googlebot (66.249.64.32) can access the webserver even if it is blocked by UFW?!
9. Already quite desperate I finally executed the following line to drop all packets from 66.249.64.32.
# iptables -I INPUT -s 66.249.64.32 -j DROP # iptables -I INPUT -s 62.138.14.218 -j DROP
And yes - this actually stopped these harassing bots.
But of course, next was this:
66.249.64.35 - - [03/May/2017:15:59:21 +0200] "GET /opac/ opac-authoritiesdetail.pl?authid=16429 HTTP/1.1" 200 17661 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I also dropped this IP address and now - finally! - the OPAC search for the normal user works as fast as expected.
In fact I can't believe I should be the only one experiencing this behavior (especially since the stuff about creating "sitemap.pl" is quite hidden and however undocumented in the Koha manual).
The other thing is people usually say it's a good thing to be indexed by Google. Today however, I won't agree. Maybe tomorrow, I will then try to delete the rule which drops the Google packets and I really hope Google will then do what it is told to do in "robots.txt", using the Koha sitemap.
So all this just for the record - maybe it will help someone in the future.
Best wishes: Michael -- Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz T 0041 (0)61 261 55 61 · E mik@adminkuhn.ch · W www.adminkuhn.ch
-- *Hugo Agud - Orex Digital * *www.orex.es <http://www.orex.es>* <http://www.orex.es/> [image: www.orex.es/koha] <http://www.orex.es/koha> [image: www.orex.es/vufind] <http://www.orex.es/vufind> <http://www.orex.es/omeka> Director Calle Sant Joaquin,117, 2º-3ª · 08922 Santa Coloma de Gramanet - Tel: 933 856 138 hagud@orex.es · http://www.orex.es/ No imprima este mensaje a no ser que sea necesario. Una tonelada de papel implica la tala de 15 árboles y el consumo de 250.000 litros de agua. Aviso de confidencialidad Este mensaje contiene información que puede ser CONFIDENCIAL y/o de USO RESTRINGIDO. Si usted no es el receptor deseado del mensaje (ni está autorizado a recibirlo por el remitente), no está autorizado a copiar, reenviar o divulgar el mensaje o su contenido. Si ha recibido este mensaje por error, por favor, notifíquenoslo inmediatamente y bórrelo de su sistema.