[Koha] Koha slowed down by Google indexing?!

Hugo Agud hagud at orex.es
Thu May 4 02:20:43 NZST 2017


Hi

You 're not the only one who has suffered this from Google, but Baidu is
worse and some others as well, giving you telegram answers to your points...

Yes I have also suffered a lot from crawlers, and I have spend a lot of
hours trying to adjut firewalls, robots....

What version of Koha you're using? moderns one have a command koha-sitemap
(If I am not wrong)

Google Webmaster warns you that it has not inmediate effect, you should
wait a little more...


In summary you have done all the expected work, now it is just time to
ajust it and wait for the results

With the combinations of robots.txt, koha-sitemap & firewall I have been
happy for a long time... but you're not save from this never

:( I am sorry..








2017-05-03 16:14 GMT+02:00 Michael Kuhn <mik at adminkuhn.ch>:

> Hi Mark and Hugo
>
> Many thanks for your hints! I have now done the following.
>
> 1. I created a file "/usr/share/koha/opac/htdocs/robots.txt" containing
> this:
>
>  Sitemap: sitemapindex.xml
>  User-agent: *
>  Disallow: /cgi-bin/
>
> 2. I generated a Koha sitemap using the seemingly undocumented Perl script
> "sitemap.pl" (according to https://bugs.koha-community.or
> g/bugzilla3/show_bug.cgi?id=11190) which created the file
> "/usr/share/koha/opac/htdocs/sitemapindex.xml" and the file
> "/usr/share/koha/opac/htdocs/sitemap0001.xml" containing the URLs.
>
> 3. Even after a complete reboot of the host the "opac-search.pl"
> processes were still there, appearing immediately after the reboot!
>
> 4. I went to Google Webmaster Tools where I downloaded the HTML
> confirmation file "googleb56bd3db2af352b1.html" and placed it in
> "/usr/share/koha/opac/htdocs" as well. I also followed the steps given on
> the Wemaster Tools page, i. e. I called the URL and I confirmed the
> download.
>
> 5. Even after a complete reboot of the host the "opac-search.pl"
> processes were still there, appearing immediately after the reboot!
>
> 6. I then installed the Uncomplicated Firewall / UFW where I applied the
> following rules and enabled it:
>
>  # ufw status
>  Status: active
>
>  To                         Action      From
>  --                         ------      ----
>  22/tcp                     ALLOW       Anywhere
>  80/tcp                     ALLOW       Anywhere
>  8080/tcp                   ALLOW       Anywhere
>  Anywhere                   DENY        66.249.64.32
>
> But however this is possible, still Googlebot is crawling and eating CPU!
> This can be seen in the log file "plack.log" where hundreds and thousands
> of lines like the following can be seen:
>
>  66.249.64.32 - - [03/May/2017:15:48:28 +0200] "GET /opac/
> opac-authoritiesdetail.pl?authid=12872 HTTP/1.1" 200 17703 "-"
> "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
>
> And I also found another bot:
>
>  62.138.14.218 - - [03/May/2017:15:48:29 +0200] "GET /opac/
> opac-search.pl?q=se,phr:%22Zeitreise%22 HTTP/1.1" 200 54672 "-" "Linguee
> Bot (http://www.linguee.com/bot; bot at linguee.com)"
>
> Now what I don't understand is how Googlebot (66.249.64.32) can access
> the webserver even if it is blocked by UFW?!
>
> 9. Already quite desperate I finally executed the following line to drop
> all packets from 66.249.64.32.
>
>  # iptables -I INPUT -s 66.249.64.32 -j DROP
>  # iptables -I INPUT -s 62.138.14.218 -j DROP
>
> And yes - this actually stopped these harassing bots.
>
> But of course, next was this:
>
>  66.249.64.35 - - [03/May/2017:15:59:21 +0200] "GET /opac/
> opac-authoritiesdetail.pl?authid=16429 HTTP/1.1" 200 17661 "-"
> "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
>
> I also dropped this IP address and now - finally! - the OPAC search for
> the normal user works as fast as expected.
>
> In fact I can't believe I should be the only one experiencing this
> behavior (especially since the stuff about creating "sitemap.pl" is quite
> hidden and however undocumented in the Koha manual).
>
> The other thing is people usually say it's a good thing to be indexed by
> Google. Today however, I won't agree. Maybe tomorrow, I will then try to
> delete the rule which drops the Google packets and I really hope Google
> will then do what it is told to do in "robots.txt", using the Koha sitemap.
>
> So all this just for the record - maybe it will help someone in the future.
>
> Best wishes: Michael
> --
> Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
> Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
> T 0041 (0)61 261 55 61 · E mik at adminkuhn.ch · W www.adminkuhn.ch
>



-- 

*Hugo Agud - Orex Digital *

*www.orex.es <http://www.orex.es>*


<http://www.orex.es/>    [image: www.orex.es/koha] <http://www.orex.es/koha>
   [image: www.orex.es/vufind] <http://www.orex.es/vufind>
<http://www.orex.es/omeka>


Director

Calle Sant Joaquin,117, 2º-3ª · 08922 Santa Coloma de Gramanet - Tel: 933
856 138   hagud at orex.es · http://www.orex.es/



No imprima este mensaje a no ser que sea necesario. Una tonelada de papel
implica la tala de 15 árboles y el consumo de 250.000 litros de agua.



Aviso de confidencialidad
Este mensaje contiene información que puede ser CONFIDENCIAL y/o de USO
RESTRINGIDO. Si usted no es el receptor deseado del mensaje (ni
está autorizado a recibirlo por el remitente), no está autorizado a copiar,
reenviar o divulgar el mensaje o su contenido. Si ha recibido este mensaje
por error, por favor, notifíquenoslo inmediatamente y bórrelo de su sistema.


More information about the Koha mailing list