[Koha] Has NFC vs. NFD encoding changed for Unicode in koha?
Jesse Savage
jessava at gmail.com
Sun Sep 24 20:41:16 NZDT 2023
Katrin,
Thanks very much for your reply! And I apologize for not replying earlier,
but had some difficulties setting up a sandbox (koha in a Virtualbox debian
VM on my Windows 11 laptop).
I have since discovered that NFD is apparently limited ONLY to the OPAC
"Save record>MARC (Unicode/UTF-8)" functionalities (i.e., both this submenu
and the one ending in "..8-Standard that produces ".marcstd" files). For
both the in-koha editor (where the diacritic character can be removed from
either left [Del] or right [Bksp] of it with one keystroke) and the
intranet "Cataloging>Export catalog data" function, NFC is preserved.
However, I also discovered something perhaps related. A Hungarian item that
is currently entered into the catalog with the un-diacritic-ed word
"ferfiak" can be retrieved by the proper form "férfiak", and vice versa for
the word "századi" in another title, which is retrieved by "szazadi". (This
holds true even for the rarer character " ő " [in an edition statement, " Első
kiadás"]). However, the French word (in a subject heading) "féminins"
cannot be retrieved by "femenins" (and the " é " is definitely in NFC form
within the database(s)).
Curiouser and curiouser, as Carroll's Alice said. I work mainly in batches
for cataloging enhancements of records exported by NFC-retaining methods. I
have not infrequently used the OPAC "Save record..." export strategy to
grab an extra (related) record to add to a batch, but don't remember it
previously NFD-ing the characters--but I can't identify/locate a specific
previously-downloaded single record that had the proper characters already
in it. (I usually have to supply them myself.) Maybe this has to do with
different indexing mechanisms for subject headings via-à-vis titles/edition
statements?
Perhaps some developer might comment?
Thanks again!
Jesse
On Sat, Sep 9, 2023 at 9:35 AM Katrin Fischer <katrin.fischer.83 at web.de>
wrote:
> Hi Jesse,
>
> first: of course you can join koha-devel and you are welcome there :)
>
> I am not aware of any conscious/intentional change in Koha and I know
> that we've always been importing and exporting records with combined
> diacritics (NFC) into Koha to avoid display and editing issues (German
> umlauts etc.)
>
> When you edit the records in Koha, do they present with NFC or NFD? (you
> can tell by removing a diacritic, does it require one or two steps?)
>
> How were the records added to Koha?
>
> Can you replicate the behaviour on another installation for example on a
> sandbox?
>
> Hope this helps,
>
> Katrin
>
> On 03.09.23 05:11, Jesse Savage wrote:
> > This is probably a question for a developer, but I'm not one, so don't
> > really want to join the koha-devel list (and they probably wouldn't want
> me
> > to), and I know a lot of developers frequent this list. My apologies to
> > anyone who might not be interested in this question.
> >
> > I recently got a new Windows laptop on which I installed (I think) a
> newer
> > version of WSL/Ubuntu, and I see that records exported with the "MARC
> > (Unicode/UTF-8)" option (as *utf8 files, which look to be basically *mrc)
> > apparently use NFD (or "decomposed") encodings for Unicode characters (in
> > my case, mainly Spanish and French titles) and thus don't display
> properly
> > in *less*, *grep*, or *cat* (the diacritic follows the standard-Latin
> > character, rather than integrated with it) as do files encoded with NFC
> > characters, and the characters also can't be *grep*'d with searches like
> > "grep $'\u16A0'." The program I use to update records for uploading also
> > outputs NFD, even if the records it takes for input contain NFC. (I'm
> > waiting on the answer to the present questions to see whether that
> program
> > has an option hidden somewhere to output NFC instead, which I'd prefer.)
> >
> > So, is this a systematic change in koha? *Should I ensure that *.mrc
> files
> > for batch uploading ALWAYS include only NFD characters, or do the
> > underlying processes standardize NFC vs. NFD?* The system on which the
> > catalog lives (but which I don't administer) currently has koha
> > 23.05.00.000 and runs on "SMP Debian 5.10.179-1 (2023-05-12) x86_64".
> > Please feel free to respond directly if you feel the answer won't
> interest
> > anybody else.
> >
> > Thanks very much in advance!
> > Jesse
> > ---------------------
> > Jesse Savage
> > (pronouns he, him, his)
> > jessava at gmail.com
> > _______________________________________________
> >
> > Koha mailing list http://koha-community.org
> > Koha at lists.katipo.co.nz
> > Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
> _______________________________________________
>
> Koha mailing list http://koha-community.org
> Koha at lists.katipo.co.nz
> Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha
>
--
---------------------
Jesse Savage
(pronouns he, him, his)
jessava at gmail.com
More information about the Koha
mailing list