[Koha] Identifying Records with Non-Roman Characters

Tue Jun 8 16:33:42 NZST 2021

Hi Charles,

I'm not 100% sure what you're asking here. Are you asking to find all records where there is a 245 title that isn't Romanized? 

You could try something like this:
SELECT *
  FROM biblio
 WHERE title <> CONVERT(title USING latin1);

I've tried that out on one of my multilingual libraries and it had some decent results. 

However, it's worth noting that it isn't a perfect solution. There are certain characters (I've noticed a particular hyphen) that won't have a latin1 equivalent, but they'd still be "Roman" (like hyphen which is made of the hexadecimal bytes E2 80 90, but is not to be confused with hyphen-minus which is ASCII and represented by the byte 2D). Other examples are emoji and other symbols like ↔️.  

You could then tweak the query or do a visual scan through to filter out any results that are irrelevant.

Anyway, I hope that helps you advance your work.

David Cook
Software Engineer
Prosentient Systems
Suite 7.03
6a Glen St
Milsons Point NSW 2061
Australia

Office: 02 9212 0899
Online: 02 8005 0595

-----Original Message-----
Date: Mon, 7 Jun 2021 09:53:24 +0900
From: Charles Kelley <cmkelleymls at gmail.com>
To: Discussion Group Koha <koha at lists.katipo.co.nz>
Subject: [Koha] Identifying Records with Non-Roman Characters
Message-ID:
	<CAM8F7wqwMav=yuhDjPM7r8ZDZKjzHWONtY397AxEc-TWw-Mb5w at mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

Hello, all!

    Our catalog has 8766 records that lack the 880 field for non-Roman titles. Examples:

=245  10$aВеликая отецественная война 1941-1945 /$cВ. И. Чуйков
=246  11$a Velikaia otechestvennaia voina 1941-1945

=245  10$a英汉常用物理学小词典 /$cYing Li, Lian Wei
=246  11$aYing-Han changyong wulixue xiaocidian

    Of those, a significant portion is in Cyrillic, CJK, Greek, etc. Is there a way in either Koha or MarcEdit to isolate said records? The problem is to turn the 245 into 880, turn the 246 into 245 fields and then pair the resulting 880 and the new 245 into something like

=880  10$6245-45/(N$aВеликая отецественная война 1941-1945 /$cВ. И. Чуйков.
=245  10$6880-45$aVelikaia otechestvennaia voina, 1941-1945.

=880  10$6245-45/[dollar]1$a英汉常用物理学小词典 /$cYing Li, Lian Wei.
=245  10$6880-45$aYing-Han changyong wulixue xiaocidian.

    The indicators are easy to fix after the 245 and 880 fields are reconstructed. The 245 $c subfield can be copied from the 880 $c subfield.
I'm just trying to automate the process as much as possible and thereby avoid the painstaking manual corrections if they can be avoided.

    This is an interim step. Eventually the catalog will have to be recataloged from scratch, I think.

    Many thanks, all!

-- 

    気を付けて。 /ki wo tukete/ = Take care.

    -- Charles.

    Charles Kelley, MLS
    PSC 704 Box 1029
    APO AP 96338

    Charles Kelley
    Tsukimino 1-Chome 5-2
    Tsukimino Gaadenia #210
    Yamato-shi, Kanagawa-ken
    〒242-0002 JAPAN

    +1-301-741-7122 [US cell]
    +81-80-4356-2178 [JPN cell]

    mnogojazyk at aol.com [h]
    cmkelleymls at gmail.com [p]

    linkedin.com/in/cmkelleymls <http://www.linkedin.com/in/cmkelleymls>
    Meeting Your Information Needs. Virtually.