[Koha] Koha slowed down by Google indexing?!
Michael Kuhn
mik at adminkuhn.ch
Thu May 4 02:14:55 NZST 2017
Hi Mark and Hugo
Many thanks for your hints! I have now done the following.
1. I created a file "/usr/share/koha/opac/htdocs/robots.txt" containing
this:
Sitemap: sitemapindex.xml
User-agent: *
Disallow: /cgi-bin/
2. I generated a Koha sitemap using the seemingly undocumented Perl
script "sitemap.pl" (according to
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=11190) which
created the file "/usr/share/koha/opac/htdocs/sitemapindex.xml" and the
file "/usr/share/koha/opac/htdocs/sitemap0001.xml" containing the URLs.
3. Even after a complete reboot of the host the "opac-search.pl"
processes were still there, appearing immediately after the reboot!
4. I went to Google Webmaster Tools where I downloaded the HTML
confirmation file "googleb56bd3db2af352b1.html" and placed it in
"/usr/share/koha/opac/htdocs" as well. I also followed the steps given
on the Wemaster Tools page, i. e. I called the URL and I confirmed the
download.
5. Even after a complete reboot of the host the "opac-search.pl"
processes were still there, appearing immediately after the reboot!
6. I then installed the Uncomplicated Firewall / UFW where I applied the
following rules and enabled it:
# ufw status
Status: active
To Action From
-- ------ ----
22/tcp ALLOW Anywhere
80/tcp ALLOW Anywhere
8080/tcp ALLOW Anywhere
Anywhere DENY 66.249.64.32
But however this is possible, still Googlebot is crawling and eating
CPU! This can be seen in the log file "plack.log" where hundreds and
thousands of lines like the following can be seen:
66.249.64.32 - - [03/May/2017:15:48:28 +0200] "GET
/opac/opac-authoritiesdetail.pl?authid=12872 HTTP/1.1" 200 17703 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
And I also found another bot:
62.138.14.218 - - [03/May/2017:15:48:29 +0200] "GET
/opac/opac-search.pl?q=se,phr:%22Zeitreise%22 HTTP/1.1" 200 54672 "-"
"Linguee Bot (http://www.linguee.com/bot; bot at linguee.com)"
Now what I don't understand is how Googlebot (66.249.64.32) can access
the webserver even if it is blocked by UFW?!
9. Already quite desperate I finally executed the following line to drop
all packets from 66.249.64.32.
# iptables -I INPUT -s 66.249.64.32 -j DROP
# iptables -I INPUT -s 62.138.14.218 -j DROP
And yes - this actually stopped these harassing bots.
But of course, next was this:
66.249.64.35 - - [03/May/2017:15:59:21 +0200] "GET
/opac/opac-authoritiesdetail.pl?authid=16429 HTTP/1.1" 200 17661 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I also dropped this IP address and now - finally! - the OPAC search for
the normal user works as fast as expected.
In fact I can't believe I should be the only one experiencing this
behavior (especially since the stuff about creating "sitemap.pl" is
quite hidden and however undocumented in the Koha manual).
The other thing is people usually say it's a good thing to be indexed by
Google. Today however, I won't agree. Maybe tomorrow, I will then try to
delete the rule which drops the Google packets and I really hope Google
will then do what it is told to do in "robots.txt", using the Koha sitemap.
So all this just for the record - maybe it will help someone in the future.
Best wishes: Michael
--
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
T 0041 (0)61 261 55 61 · E mik at adminkuhn.ch · W www.adminkuhn.ch
More information about the Koha
mailing list