[Koha] Has NFC vs. NFD encoding changed for Unicode in koha?

Jesse Savage jessava at gmail.com
Sun Sep 3 15:11:09 NZST 2023


This is probably a question for a developer, but I'm not one, so don't
really want to join the koha-devel list (and they probably wouldn't want me
to), and I know a lot of developers frequent this list. My apologies to
anyone who might not be interested in this question.

I recently got a new Windows laptop on which I installed (I think) a newer
version of WSL/Ubuntu, and I see that records exported with the "MARC
(Unicode/UTF-8)" option (as *utf8 files, which look to be basically *mrc)
apparently use NFD (or "decomposed") encodings for Unicode characters (in
my case, mainly Spanish and French titles) and thus don't display properly
in *less*, *grep*, or *cat* (the diacritic follows the standard-Latin
character, rather than integrated with it) as do files encoded with NFC
characters, and the characters also can't be *grep*'d with searches like
"grep $'\u16A0'." The program I use to update records for uploading also
outputs NFD, even if the records it takes for input contain NFC. (I'm
waiting on the answer to the present questions to see whether that program
has an option hidden somewhere to output NFC instead, which I'd prefer.)

So, is this a systematic change in koha? *Should I ensure that *.mrc files
for batch uploading ALWAYS include only NFD characters, or do the
underlying processes standardize NFC vs. NFD?*  The system on which the
catalog lives (but which I don't administer) currently has koha
23.05.00.000 and runs on "SMP Debian 5.10.179-1 (2023-05-12) x86_64".
Please feel free to respond directly if you feel the answer won't interest
anybody else.

Thanks very much in advance!
Jesse
---------------------
Jesse Savage
(pronouns he, him, his)
jessava at gmail.com


More information about the Koha mailing list