Re: [Koha] OPAC searches fail with ICU indexing enabled

older
Koha instance in subdirectory and...

David Cook

17 May 2016 17 May '16

2:51 a.m.

Hi Andreas, This problem looks a little familiar. I have a few questions. You find 335 records using yaz-client. Are you able to view those records using "show" in yaz-client? Also where are you seeing the following error:

...

Error: :8: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xCF 0x3C 0x2F 0x74 Î¿Ï‚ Î¤Î¹Î¼ÏŒÎ¸ÎµÎ¿Î½ Î‘Î„-Ï€ÏÎ¿Ï‚ Î¤Î¹Î¼ÏŒÎ¸ÎµÎ¿Î½ Î’Î„-Ï€ÏÎ¿Ï‚ Î¤Î¯Ï„Î¿Î½-Ï€ ^

Is that in a file in your /var/log/koha/imp directory? Also, those instructions at https://wiki.koha-community.org/wiki/Correcting_Search_of_Arabic_records look a bit suboptimal... Are you using packages? Did you run the following? sudo koha-restart-zebra {yourinstance} sudo koha-rebuild-zebra -f {yourinstance} That parser error doesn't look super helpful... using Windows-1251 0xCF is Ï, 3C is <, / is 2F. With UTF-8, χ is 0xce 0xa7 and ό is 0xce 0x8c. So there isn't a clear relation there. If I had to guess, I'd say that Zebra thinks it's using ICU and UTF-8 but the data is still stored as Latin-1. Failing that... I have some other more in-depth troubleshooting ideas. David Cook Systems Librarian Prosentient Systems 72/330 Wattle St Ultimo, NSW 2007 Office: 02 9212 0899 Direct: 02 8005 0595

...

-----Original Message----- Message: 7 Date: Wed, 11 May 2016 18:12:51 +0300 From: Andreas Roussos <arouss1980@gmail.com> To: koha@lists.katipo.co.nz Subject: [Koha] OPAC searches fail with ICU indexing enabled Message-ID: <CAK0RUrtVcZZ0jOqgmvPxrcXWw4g_qqQ3_MD5OqHYHkz_sfdcGQ @mail.gmail.com> Content-Type: text/plain; charset=UTF-8

Dear list,

We're running Koha 3.20.04 on Ubuntu 14.04, and recently enabled ICU indexing as per the instructions on the wiki (https://wiki.koha- community.org/wiki/Correcting_Search_of_Arabic_records)

Most searches work fine, but queries for certain Greek characters in OPAC (for example "[χ.ό.]"), return the following message:

Error: :8: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xCF 0x3C 0x2F 0x74 Î¿Ï‚ Î¤Î¹Î¼ÏŒÎ¸ÎµÎ¿Î½ Î‘Î„-Ï€ÏÎ¿Ï‚ Î¤Î¹Î¼ÏŒÎ¸ÎµÎ¿Î½ Î’Î„-Ï€ÏÎ¿Ï‚ Î¤Î¯Ï„Î¿Î½-Ï€ ^

If I use the command-line zebra client to perform the same search, I get 335 hits:

$ yaz-client -c /etc/koha/zebradb/ccl.properties unix:/var/run/koha/imp/bibliosocket Connecting...OK. Sent initrequest. Connection accepted by v3 target. ID : 81 Name : Zebra Information Server/GFS/YAZ Version: 4.2.30 98864b44c654645bc16b2c54f822dc2e45a93031 Options: search present delSet triggerResourceCtrl scan sort extendedServices namedResultSets Elapsed: 0.000743 Z> base biblios Z> f [χ.ό.] Sent searchRequest. Received SearchResponse. Search was a success. Number of hits: 335, setno 1 SearchResult-1: term=χο cnt=335 records returned: 0 Elapsed: 0.014453

So, it looks as if Zebra can actually perform the search but somehow the results cannot be displayed in OPAC.

Does anyone have any clues as to why this is happening?

Kind regards, Andreas

------------------------------

Show replies by date

Andreas Roussos

17 May 17 May

8:15 a.m.

New subject: OPAC searches fail with ICU indexing enabled

Hi David, Thank you for your reply. Please see my answers inline below: On Tue, May 17, 2016 at 4:51 AM, David Cook <dcook@prosentient.com.au> wrote:

...

Hi Andreas,

This problem looks a little familiar. I have a few questions.

You find 335 records using yaz-client. Are you able to view those records using "show" in yaz-client?

Yes, I can view the records using "show", or "show 1", "show 42" etc.

...

Also where are you seeing the following error:

...
Error: :8: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xCF 0x3C 0x2F 0x74 Î¿Ï‚ Î¤Î¹Î¼ÏŒÎ¸ÎµÎ¿Î½ Î‘Î„-Ï€Ï Î¿Ï‚ Î¤Î¹Î¼ÏŒÎ¸ÎµÎ¿Î½ Î’Î„-Ï€Ï Î¿Ï‚ Î¤Î¯Ï„Î¿Î½-Ï€ ^

Is that in a file in your /var/log/koha/imp directory?

No, this is actually displayed in my web browser when searching in OPAC.

...

Also, those instructions at https://wiki.koha-community.org/wiki/Correcting_Search_of_Arabic_records look a bit suboptimal...

Are you using packages? Did you run the following?

sudo koha-restart-zebra {yourinstance} sudo koha-rebuild-zebra -f {yourinstance}

Yes, I'm using packages and I've run both zebra commands.

...

That parser error doesn't look super helpful... using Windows-1251 0xCF is Ï, 3C is <, / is 2F. With UTF-8, χ is 0xce 0xa7 and ό is 0xce 0x8c. So there isn't a clear relation there. If I had to guess, I'd say that Zebra thinks it's using ICU and UTF-8 but the data is still stored as Latin-1.

What I find odd is that other searches in OPAC for Greek characters work fine and return records (for example searching for "[α.β.]", "[α.ό.]"). It looks as if there's something contained in the results for "[χ.ό.]" that causes the failure. Failing that... I have some other more in-depth troubleshooting ideas.

...

I'd be more than happy to hear those :-)

...

David Cook Systems Librarian

Prosentient Systems 72/330 Wattle St Ultimo, NSW 2007

Office: 02 9212 0899 Direct: 02 8005 0595

Regards, Andreas

David Cook

18 May 18 May

2:46 a.m.

New subject: OPAC searches fail with ICU indexing enabled

Hi Andreas, Thanks for your response. It’s very helpful! As you say, the problem probably is in individual records in the results for "[χ.ό.]". Do you see the error in the OPAC immediately after searching, or is it on a certain page of results? If it’s on a certain page of results, we might be able to narrow down the problem further. When using “show” in yaz-client, are you able to view every single record? You might try using “format xml” in yaz-client if you’re not already doing so, as that might help problems to surface. How many records are in your Koha database overall? Depending on how technical you are, you might consider trying the MARC checker plugin ( <https://github.com/bywatersolutions/koha-plugin-marc-checker> https://github.com/bywatersolutions/koha-plugin-marc-checker), or writing your own script to iterate through all the MARCXML records in your database and try to create MARC::Record objects from them. If the problem is with the record itself, that’s a good way of discovering which one(s) are at fault. If you’re having issues with results for "[χ.ό.]", it’s probably a safe assumption that there’s other problems with records in the database, so scanning all the records is probably a good idea. If you have a very large database, you can break it up into chunks using biblionumber. David Cook Systems Librarian Prosentient Systems 72/330 Wattle St Ultimo, NSW 2007 Office: 02 9212 0899 Direct: 02 8005 0595 From: Andreas Roussos [mailto:arouss1980@gmail.com] Sent: Tuesday, 17 May 2016 5:15 PM To: David Cook <dcook@prosentient.com.au> Cc: koha@lists.katipo.co.nz Subject: Re: [Koha] OPAC searches fail with ICU indexing enabled Hi David, Thank you for your reply. Please see my answers inline below: On Tue, May 17, 2016 at 4:51 AM, David Cook <dcook@prosentient.com.au <mailto:dcook@prosentient.com.au> > wrote: Hi Andreas, This problem looks a little familiar. I have a few questions. You find 335 records using yaz-client. Are you able to view those records using "show" in yaz-client? Yes, I can view the records using "show", or "show 1", "show 42" etc. Also where are you seeing the following error:

...

Error: :8: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xCF 0x3C 0x2F 0x74 Î¿Ï‚ Î¤Î¹Î¼ÏŒÎ¸ÎµÎ¿Î½ Î‘Î„-Ï€Ï Î¿Ï‚ Î¤Î¹Î¼ÏŒÎ¸ÎµÎ¿Î½ Î’Î„-Ï€Ï Î¿Ï‚ Î¤Î¯Ï„Î¿Î½-Ï€ ^

Is that in a file in your /var/log/koha/imp directory? No, this is actually displayed in my web browser when searching in OPAC. Also, those instructions at https://wiki.koha-community.org/wiki/Correcting_Search_of_Arabic_records look a bit suboptimal... Are you using packages? Did you run the following? sudo koha-restart-zebra {yourinstance} sudo koha-rebuild-zebra -f {yourinstance} Yes, I'm using packages and I've run both zebra commands. That parser error doesn't look super helpful... using Windows-1251 0xCF is Ï, 3C is <, / is 2F. With UTF-8, χ is 0xce 0xa7 and ό is 0xce 0x8c. So there isn't a clear relation there. If I had to guess, I'd say that Zebra thinks it's using ICU and UTF-8 but the data is still stored as Latin-1. What I find odd is that other searches in OPAC for Greek characters work fine and return records (for example searching for "[α.β.]", "[α.ό.]"). It looks as if there's something contained in the results for "[χ.ό.]" that causes the failure. Failing that... I have some other more in-depth troubleshooting ideas. I'd be more than happy to hear those :-) David Cook Systems Librarian Prosentient Systems 72/330 Wattle St Ultimo, NSW 2007 Office: 02 9212 0899 Direct: 02 8005 0595 Regards, Andreas

Andreas Roussos

8:52 a.m.

New subject: OPAC searches fail with ICU indexing enabled

Hi David, On Wed, May 18, 2016 at 4:46 AM, David Cook <dcook@prosentient.com.au> wrote:

...

Hi Andreas,

Thanks for your response. It’s very helpful!

Thank _you_ for your time.

...

As you say, the problem probably is in individual records in the results for "[χ.ό.]". Do you see the error in the OPAC immediately after searching, or is it on a certain page of results? If it’s on a certain page of results, we might be able to narrow down the problem further.

I see the problem in the OPAC immediately after searching, no results are displayed at all.

...

When using “show” in yaz-client, are you able to view every single record? You might try using “format xml” in yaz-client if you’re not already doing so, as that might help problems to surface.

Yes, I enabled "format xml" and was able to view all 335 records returned for my search by typing "show" repeatedly.

...

How many records are in your Koha database overall? Depending on how technical you are, you might consider trying the MARC checker plugin ( https://github.com/bywatersolutions/koha-plugin-marc-checker), or writing your own script to iterate through all the MARCXML records in your database and try to create MARC::Record objects from them. If the problem is with the record itself, that’s a good way of discovering which one(s) are at fault.

If you’re having issues with results for "[χ.ό.]", it’s probably a safe assumption that there’s other problems with records in the database, so scanning all the records is probably a good idea. If you have a very large database, you can break it up into chunks using biblionumber.

We have approx. 22k records, but we're using UNIMARC (apologies for not mentioning this earlier). For what it's worth, I enabled the MARC checker plugin and ran a report on biblionumbers 1 to 200 (biblionumber 147 contained the string [χ.ό.]). This resulted in a lot of output like "245: No 245 tag." because we store the Title in field 200a. So, as I understand it, this particular plugin is tailored towards MARC21 installations. I'm not well versed with Perl so writing my own MARC checker script would be difficult. However, I do know a little bit of C, so I've written a small program that connects to our MySQL DB and fetches the 'marcxml' field of a particular biblionumber. I then redirect the output of this program to a file, and (based on http://stackoverflow.com/questions/115210/utf-8-validation) run `iconv` on the file to see if it contains any invalid UTF-8 data. No records with UTF-8 "oddities" have been found using this method :-( BTW, will you be attending KohaCon'16 by any chance? Regards, Andreas

...

David Cook

Systems Librarian

Prosentient Systems

72/330 Wattle St

Ultimo, NSW 2007

Office: 02 9212 0899

Direct: 02 8005 0595

David Cook

23 May 23 May

2:03 a.m.

New subject: OPAC searches fail with ICU indexing enabled

Hi Andreas, I just recalled that the query that you send through yaz-client is probably very different from the one that Koha sends to Zebra, as Koha does lots of extra special stuff to the query in the background. So you might not be retrieving the problematic records… Can you run “bin/maintenance/touch_all_biblios.pl”? That might reveal problematic records. That’s interesting that the UTF-8 check passed for all the records in your database. That does seem to suggest some mangling within Zebra. If “bin/maintenance/touch_all_biblios.pl” doesn’t produce any errors or bad records, then that would be the next thing to examine in more detail I suppose. I had hoped to go to Kohacon16, but unfortunately I won’t be able to make it. David Cook Systems Librarian Prosentient Systems 72/330 Wattle St Ultimo, NSW 2007 Office: 02 9212 0899 Direct: 02 8005 0595 From: Andreas Roussos [mailto:arouss1980@gmail.com] Sent: Wednesday, 18 May 2016 5:52 PM To: David Cook <dcook@prosentient.com.au> Cc: koha@lists.katipo.co.nz Subject: Re: [Koha] OPAC searches fail with ICU indexing enabled Hi David, On Wed, May 18, 2016 at 4:46 AM, David Cook <dcook@prosentient.com.au <mailto:dcook@prosentient.com.au> > wrote: Hi Andreas, Thanks for your response. It’s very helpful! Thank _you_ for your time. As you say, the problem probably is in individual records in the results for "[χ.ό.]". Do you see the error in the OPAC immediately after searching, or is it on a certain page of results? If it’s on a certain page of results, we might be able to narrow down the problem further. I see the problem in the OPAC immediately after searching, no results are displayed at all. When using “show” in yaz-client, are you able to view every single record? You might try using “format xml” in yaz-client if you’re not already doing so, as that might help problems to surface. Yes, I enabled "format xml" and was able to view all 335 records returned for my search by typing "show" repeatedly. How many records are in your Koha database overall? Depending on how technical you are, you might consider trying the MARC checker plugin ( <https://github.com/bywatersolutions/koha-plugin-marc-checker> https://github.com/bywatersolutions/koha-plugin-marc-checker), or writing your own script to iterate through all the MARCXML records in your database and try to create MARC::Record objects from them. If the problem is with the record itself, that’s a good way of discovering which one(s) are at fault. If you’re having issues with results for "[χ.ό.]", it’s probably a safe assumption that there’s other problems with records in the database, so scanning all the records is probably a good idea. If you have a very large database, you can break it up into chunks using biblionumber. We have approx. 22k records, but we're using UNIMARC (apologies for not mentioning this earlier). For what it's worth, I enabled the MARC checker plugin and ran a report on biblionumbers 1 to 200 (biblionumber 147 contained the string [χ.ό.]). This resulted in a lot of output like "245: No 245 tag." because we store the Title in field 200a. So, as I understand it, this particular plugin is tailored towards MARC21 installations. I'm not well versed with Perl so writing my own MARC checker script would be difficult. However, I do know a little bit of C, so I've written a small program that connects to our MySQL DB and fetches the 'marcxml' field of a particular biblionumber. I then redirect the output of this program to a file, and (based on http://stackoverflow.com/questions/115210/utf-8-validation) run `iconv` on the file to see if it contains any invalid UTF-8 data. No records with UTF-8 "oddities" have been found using this method :-( BTW, will you be attending KohaCon'16 by any chance? Regards, Andreas David Cook Systems Librarian Prosentient Systems 72/330 Wattle St Ultimo, NSW 2007 Office: 02 9212 0899 Direct: 02 8005 0595

3635

Age (days ago)

3641

Last active (days ago)

List overview

Download

4 comments

2 participants

participants (2)

Andreas Roussos
David Cook