Unicode Normalization Forms: When ö ≠ ö

hinkley · on Dec 31, 2021

Reading about unicode has made me much, much more circumspect about the meaning of != in languages, and what fall-through behavior should look like. Unicode domain names lasted for a hot minute until someone registered microsoft.com with Cyrillic letters.

Years ago I read a rant by someone who insisted that being able to mix arbitrary languages into a single String object makes sense for linguists but for most of us we would be better off being able to a assert that a piece of text was German, or sanskrit, not a jumble of both. It's been living rent free in my head for almost two decades and I can't agree with it, nor can I laugh it off.

It might have been better if the 'code pages' idea was refined instead of eliminated (that is, the string uses one or more code pages, not the process). I don't know what the right answer is, but I know Every X is a Y almost always gets us into trouble.

drdaeman · on Dec 31, 2021

> but for most of us we would be better off

That's simple - it is provably wrong. While relatively uncommon there are plenty of examples that would contradict this statement. And it's not about being able to encode the Rosetta Stone - non-scientists mix languages all the time, from Carmina Burana to Blinkenlights. They even make meaningful portmanteau words and write them with characters from multiple unrelated writing systems, like "заshitано" (see - Latin and Cyrillic scripts in the same single word!)

otabdeveloper4 · on Jan 1, 2022

You miss the point. The basic unit of ASCII v2 (aka 'Unicode') should have been the codepage, not the codepoint. Having a stateful stream of codepage-symbol pairs is not a problem - in practice, all Unicode encodings ended up being stateful anyways, except in a shitty way that doesn't help to encode any semantic information.

Nitramp · on Jan 1, 2022

I'm intrigued, what's заshitано?

drdaeman · on Jan 1, 2022

A portmanteau of Russian "засчитано" ("credited", "taken into account", "check!") and English "shit".

The word is a joke and there is no well-defined meaning. I've seen it used as both "it counts but it's shitty" and as a way to give credit for a failure.

Surely that's not the best example, but it's a word I've remembered.

cyphar · on Jan 1, 2022

Aside from all of the other issues mentioned, for some languages it's not clear what language something is purely based on codepoints.

For languages that have Latin-derived writing systems, it's not uncommon to use English letters (without diacritics) to write the language -- how would that be handled? In addition, thanks to Han unification (though this would've been a problem anyway -- loads of characters would've been unified anyway) all similar CJK Hanzi/Kanji/漢字 characters are mapped to the same codepoint regardless of language. This means that for some sentences it is entirely possible for you to not know whether a sentence fragment is Chinese or Japanese without more context and a native-like understanding of the language.

Also in many languages English words are written verbatim meaning that you can have sentences like (my Japanese is not perfect, this is just an example):

> あの芸人のYouTube動画を見たの？面白すぎるww

And (at least in Japanese) there are loads of other usage of Latin characters aside from loan words that would be too unwieldy to write in katakana -- "w" is like "lol", and fair few acronyms (BGM = Background Music, CM = Advertisement, TKG = 卵かけご飯 = (Raw) Egg on Rice). There are other languages that have similar "issues", but unfortunately I can't give any more examples because the only other language I speak (Serbian) writes everything (even people's names) phonetically.

As an aside -- if anyone ever has to support CJK languages (in subtitles for instance), please make sure you use the right fonts. While Unicode has encoded Han characters with the same codepoint, in different languages the characters are often drawn differently and you need to use the right font for the corresponding language (and area -- Mandarin speakers in different areas write some characters differently -- 返 is different in every CJK locale). Many media players and websites do not handle this correctly and it is fairly frustrating -- the net result is that Japanese is often displayed using Chinese fonts which makes it uncomfortable to read (it's still obvious what the character is, it's just off-putting).

kingcharles · on Jan 1, 2022

Yeah, for the same reason that CJK designers often use absolutely abhorrent fonts for English words on packaging and printed media, is the same reason Westerners use terrible fonts for CJK. They are working with a language and culture they have no knowledge of, and think that if the characters look sorta right, then they must be readable and look good.

I've made so many horrible localization errors in the past because I had to translate things into 30 different languages and I can only barely read a handful of them, so I just copy and paste whatever the translators give me.

Incidently, I saw Japanese text recently that was quoting both English AND Arabic in the same sentence. And this was in a block of vertical text. That is literally a worst-case scenario I think. You have RTL-vertical text also containing RTL- and LTR-horizontal text. And unlike English, which when placed into vertical Japanese text, you can essentially choose whether you want the characters going vertically or sideways - you can't do that with Arabic as the letters must be joined together, I don't believe you can break them down and stack them on top of each other.

Why can't we all just convert to Esperanto? ;)

divingdragon · on Jan 1, 2022

> Incidently, I saw Japanese text recently that was quoting both English AND Arabic in the same sentence. And this was in a block of vertical text. That is literally a worst-case scenario I think.

Look up Mongolian script and you might change your mind :)

kingcharles · on Jan 1, 2022

Oh Lord. I just looked it up on Wikipedia. It's a vertical-only language for a start. How do you even discuss a vertical-only language in a language which is horizontal? Wikipedia has to write all the script horizontally, which would be like a page talking about English stacking all the letters vertically - it's weird.

From Wikipedia: "Computer operating systems have been slow to adopt support for the Mongolian script, and almost all have incomplete support or other text rendering difficulties."

danudey · on Dec 31, 2021

In our Jenkins system, we have remote build nodes return data back to the primary node via environment variable-style formatted files (e.g. FOO=bar), so when I had to send back a bunch of arbitrary multi-line textual data, I decided to base64 encode it. Simple enough.

On *nix systems, I ran this through the base64 command; the data was UTF8, which meant that in practice it was ASCII (because we didn't have any special characters in our commit messages).

On Windows systems... oh god. The system treats all text as UTF-16 with whatever byte order, and it took me ages to figure out how to get it to convert the data to UTF-8 before encoding it. Eventually it started working, and it worked for a while until it didn't for whatever reason. I ended up tearing out all the code and just encoding the UTF-16 in base64 and then processing that into UTF-8 on the master where I had access to much saner tools.

Generally speaking, "Unicode" works great in most cases, but when you're dealing with systems with weird or unusual encoding habits, like Windows using UTF-16 or MySQL's "utf8" being limited to three bytes per unicode character instead of four, everything goes out the window and it's the wild west all over again.

smitty1e · on Jan 1, 2022

We had to ETL .csv data that must have originated in SQLServer.

The utf-16 fact about Windows was apparently unknown to my predecessor.

Who wrote some nasty c-language binary to copy the data, knock the upper byte off of each character ahead, and save the now ASCII text to a new file of the mysql load.

The encoding='utf-16' argument was all that was needed.

For want of a nail. . .

tialaramex · on Jan 1, 2022

I've had to fix this before. A co-worker working with data from a 3rd party supplier had gone "Oh this input data is mangled with stray zero bytes, I'll fix that" and of course that destroys any non-ASCII inputs, eventually I'm told that sometimes the import fails, I investigate, and I realise the "mangled" input is just UTF-16 encoded, conditionally remove the "strip zero bytes" hack and tell the decoder it's UTF-16 and it just works correctly.

The "maybe strip null bytes" code lived for years "just in case" after I fixed that because people couldn't believe that's all that was ever "wrong" with the data.

pshc · on Jan 1, 2022

The perils of valuing backwards compatibility above all else… imagine having to use UTF16 in this day and age. Happy 2022!

vbezhenar · on Jan 1, 2022

UTF-16 is a simple encoding. It should take a few dozens of LoC to convert to UTF-8. At least if you don’t need extreme performance with AVX, etc.

bjoli · on Jan 1, 2022

My experience is exactly like what you say. Until it isn't. UTF16 seems like such a neat idea until it meets reality.

Mostly because of weird interactions of different libraries, languages and operating systems.

saati · on Jan 1, 2022

> UTF-16 is a simple encoding

This isn't true, for starters it's two encodings UTF-16LE and UTF-16BE.

BlueTemplar · on Jan 1, 2022

Funnily base64 suffers from a related issue that the likes of base58 correct : l and I or O and 0 looking similar or even identical depending on the font !

dahfizz · on Jan 1, 2022

Why does that matter? When would a human need to read and comprehend base64 encoded data?

kadoban · on Jan 1, 2022

Base58 is used for example for Bitcoin addresses. Being able to type an address from one system to another is a nice property, and it's much less error-prone if you don't have to worry about look-alike characters.

int_19h · on Dec 31, 2021

You can already map Unicode ranges to "code pages" of sorts, so how would that help?

Thing is, people who are not linguists do want to mix languages. It's very common in some cultures to intersperse the native language with English. But even if not, if the language in question uses a non-Latin alphabet, there are often bits and pieces of data that have to be written down in Latin. So that "most of us" perspective is really "most of us in US and Western Europe", at best.

For domains and such, what I think is really needed is a new definition of string equality that boils down to "are people likely to consider these two the same?". So that would e.g. treat similarly-shaped Latin/Greek/Cyrillic letters the same.

jrochkind1 · on Dec 31, 2021

Oh, you can do far more than "code pages of sorts". Unicode has a variety of metadata available about each codepoint. The things that are "code pages of sorts" are maybe "block" (for ö "Latin-1 Supplement"), and "plane" (for ö it's "Basic Multilingual Plane"), but those are really mostly administrative and probably not what want.

But you also have "Script" (for ö "Latin). Some characters belong to more than one script though. Unicode will tell you that.

Unicode also has a variety of algorithms available already written. One of the most relevant ones here is... normalization. To compare two strings in the broadest semantic sense of "are people likely to consider these the same", you want want a "compatibility" normalization. NFKC or NFKD. They will for instance make `1` and `¹`[superscript] the same, which is definitely one kind of "consider these the same" -- very useful for, say, a search index.

That won't be iron-clad, but that will be better than trying to role your own algorithm involving looking at character metadata yourself! But it won't get you past intentional attacks using "look-alike" characters that are actually different semantically but look similar/indistinguishable depending on font. The trick is "consider these the same" really, it turns out, depends on context and purpose, it's not always the same.

Unicode also has a variety of useful guides as part of the standard, including the guide to normalization https://unicode.org/reports/tr15/ and some guides related to security (such as https://unicode.org/reports/tr36/ and http://unicode.org/reports/tr39/), all of which are relevant to this concern, and suggest approaches and algorithms.

Unicode has a LOT of very clever stuff in it to handle the inherently complicated problem of dealing with the entire universe of global languages that Unicode makes possible. It pays to spend some time with em.

rurban · on Jan 2, 2022

You don't need a new definition, you just need to follow the official Unicode security guidelines.

I recommend the Moderate Restrictive Security profile for identifiers for mixed scripts. TR39. Plus allow Greek with Latin. This way you can identify Cyrillic, Greek, CFK or any recommended script, but are not allowed to mix Cyrillic with Greek. And you still can write math with Greek Symbols.

What we don't have are a standard string library to compare or find strings. wcscmp does not do normalization. There is no wcsfc (foldcase) for case insensitivity. There's no wcsnorm or wcsnfc. I'm maintaining such a library.

coreutils, diff, grep, patch, sed and friends all cannot find Unicode strings, they have no string support. They can only mimic filesystems, finding binary garbage. Strings are so rthi g different than pure ASCII or BINARY garbage. Strings have an encoding and are Unicode.

Filesystems are even worse because they need to treat filenames as identifiers, but do not. Nobody cares about TR31, TR39, TR36 and so on.

Here is an overview of the sad state of Unicode unsafeties in programming languages: https://github.com/rurban/libu8ident/blob/master/c11.md

BlueTemplar · on Dec 31, 2021

Yeah, Greek alphabet is used a lot in sciences. It's really annoying that we're only starting to get proper support now. (Including on keyboards : http://norme-azerty.fr/en/ )

int_19h · on Jan 1, 2022

By far the most common use case these days is when an URL has to be mentioned in an otherwise non-Latin text.

obua · on Jan 1, 2022

I am trying to formalise this with Cosmopolitan Identifiers (https://obua.com/publications/cosmo-id/3/). These identifiers consist of words and symbols. Symbols are normalised based on how they look like, and so Latin / Cyrillic / Greek symbols that look alike are mapped to the same symbol. Words are normalised differently, so that "Tree" and "tree" map to the same normal form. As a symbol, "T" and "t" are obviously different. I am not totally happy with the concept yet, I have implemented a fourth, simpler iteration of that concept as a Typescript package: https://www.npmjs.com/package/cosmo-id .

One of the problems is, how do you distinguish symbols and words? A simple way to do this is to classify something as a symbol if it is just a single character, and as a word otherwise. For example, "α-β" would consist of two symbols, separated by a hyphen, but "αβ" is a word and normalised to "av" based on some convention on how to "latinise" greek words.

LAC-Tech · on Dec 31, 2021

Sprinkling English with foreign words is really, really common. I'm in New Zealand and people do it all the time. And even in the states, right? Don't want two different strings because someone writes an English sentence about how much they love jalapeño.

izacus · on Jan 1, 2022

Think of just something simple like writing an immigrants name inside a sentence. It's kinda funny that people in SV, full of immigrants, never seem to think of putting their own or coworkers name in a String.

cgriswald · on Dec 31, 2021

I'm not a linguist and that will probably be readily apparent. The word jalapeño leaves me wondering how distinct a boundary a language can possess or how one can sort out which language an individual word belongs to outside the context of the rest of the text or speech.

In English, jalapeño is correctly spelled with or without the eñe (and AFAIK the letter doesn't have a name in English, you have to use the Spanish name). So, there's an English word that doesn't use the letters assigned to the English alphabet. How do we place the word? Well, obviously English borrowed the word from Spanish, so it's a Spanish word. Well, no, it's only the Spanish adjectivization of Nahautl words used to name the place called Xalapa...

wongarsu · on Dec 31, 2021

Words like angst or ersatz are English words borrowed from German. The German words are written identically (except capitalisation), but the meaning of the English word is much more specific than the German "original". Meanwhile the word "Blitz" has completely district meanings in English and German. In English it's a sudden concerted effort, in German it's lightning. Despite the English word originating from German, they don't share a meaning at all

smitty1e · on Jan 1, 2022

Isn't the etymology of 'blitz' in English due more to 'blitzkrieg'?

KarlKode · on Jan 1, 2022

German has the word blitzschnell which means really fast (as fast a bolt). So in a way the english meaning of blitz still fits your description.

WalterBright · on Jan 1, 2022

My grandfather's thesis was auf Deutsch and is sprinkled with French and Latin words.

teej · on Dec 31, 2021

It’s very common in the online messaging I’ve seen in English, Spanish, Chinese, and Russian.

wisty · on Dec 31, 2021

What's a word? (A quick test - how many words were in the previous sentence, maybe 3 or 4 depending on whether the 's is part of a word; so can we talk about Jóhannesson's foreign policy?).

It's hard enough to know what a letter is in unicode. Breaking things into words is just another massive headache.

david-gpu · on Dec 31, 2021

> Years ago I read a rant by someone who insisted that being able to mix arbitrary languages into a single String object makes sense for linguists but for most of us we would be better off being able to a assert that a piece of text was German, or sanskrit, not a jumble of both.

Presumably the person who wrote it speaks a single language.

Just because something is not useful to them, it doesn't mean it is not useful in general. There are millions of polyglots as well as documents that include words and names in multiple scripts.

jerf · on Dec 31, 2021

I think in that case the idea would either be that you should then have an array of strings, each of which may have its own language set, or that the string should be labelled as "containing Latin and Cyrillic", but still not able to include arbitrary other characters from Unicode. And multi-lingual text still generally breaks on words... Kilobytes of Latin text with a single Cyrillic character in the middle of a word is very suspicious, in a way that kilobytes of Latin text with a single Cyrillic word isn't.

Of course you'd always need an "unrestricted" string (to speak to the rest of the system if necessary), but there are very few natural strings out there in the world that consist of half-a-dozen languages just mishmashed together. Those exceptions can be treated as exceptions.

david-gpu · on Dec 31, 2021

How would that look like to an end-user? Do I need to tell my browser which scripts are contained in my emails? What would happen if I start typing in a different script?

> there are very few natural strings out there in the world that consist of half-a-dozen languages just mishmashed together

...in your experience. What is an almost-absurd exception to you, is every day life to others.

naniwaduni · on Jan 1, 2022

> Presumably the person who wrote it speaks a single language.

Presumably the person who wrote it speaks English.

matheusmoreira · on Jan 1, 2022

Of course living in denial makes it easy to ignore harsh realities. Unfortunately for them, humans don't work that way. Things aren't gonna spontaneously change just to make their life easier. The software adapts to us, not the other way around. People complain about the complexities of dates and times but they still make every effort to get it right because it matters.

If a programming language allows text processing but can't even properly compare unicode text, it is buggy and needs to be fixed. If an operating system can't deal with unicode, it's buggy and needs to be fixed.

yongjik · on Dec 31, 2021

Reminds me of the good old days with EUC-KR, KSC 5601, and all those different encoding schemes I've successfully repressed in my memory for years. Yes, you could probably assert that a piece of string was either Korean or English but never anything else... because the system was incapable of representing it.

I'm not exactly sure how a code page is supposed to help us here. Developers have trouble supporting multiple languages when they're all in the Unicode Standard. Supporting code pages for languages they've never heard of? Not a chance.

numpad0 · on Jan 1, 2022

I'd guess a standardized codepage marker like a "start of CP[932]” ｉｓｇｏｉｎｇｔｏｂｅｎｅｃｅｓｓａｒｙ CP[1252] at each CP switches but it might be just a necessity. Han unification is a well known problem to Far Eastern but Unicode normalization problem is basically the same as that.

mr_luc · on Dec 31, 2021

Heh, funny, I'm implementing this exact thing at the moment, oddly enough -- rather, implementing a security check that provides that same guarantee you mention, Mixed Script protections.

In Unicode spec terms, 'UTS 39 (Security)' contains the description of how to do this, mostly in section 5, and it relies on 'UTX 24 (Scripts)'.

It's more nuanced than your example but only slightly. If you replace "German" with "Japanese" you're talking about multiple scripts in the same 'writing system', but the spec provides files with the lists of 'sets of scripts' each character belongs to.

The way that the spec tells us to ensure that the word 'microsoft' isn't made up of fishy characters is that we just keep the intersection of each character's augmented script sets. If at the end, that intersection is empty, that's often fishy -- ie, there's no intersection between '{Latin}, {Cyrillic}'.

However, the spec allows the legit uses of writing systems that use more than one script; the lookup procedure outlined in the spec could give script sets like '{Jpan, Kore, Hani, Hanb}, {Jpan, Kana}' for two characters, and that intersection isn't empty; it'd give us the answer "Okay, this word is contained within the Japanese writing system".

nine_k · on Dec 31, 2021

Where I work and communicate, mixing 2, 3, and sometimes 4 writing systems is pretty normal; I have 3 keyboard layouts on my phone (Latin that covers English and occasional Spanish, Cyrillic, and Japanese).

In any case, there are emoji which are expected to be a part of text.

On one hand, it would be great to separate areas of different encodings inside a string. But character codes are already such separators.

Two things need to go though: the assumption of linear-time index-based access to characters in a string, and the custom to compare strings as byte arrays.

The first us already gone from several advanced string implementations. The second is harder: e.g. Linux filesystems support Unicode by being encoding-agnostic and handling names as byte sequences. Reworking that would be hard if practical at all.

tsimionescu · on Jan 1, 2022

> and the custom to compare strings as byte arrays

I think more generally, the idea that a langauge std lib can provide string equality for human langauge strings is just silly. String equality is an extremely context dependent, fuzzy operation, and should be handled by each context differently. For example for Unicode hostname to certificate mapping, hostname equality should be handled by rendering the hostname in several common web fonts and checking if the resulting bitmaps are similar. If they are, then assigning different certificates to these equal hostanmes should not be done.

Of course, in other contexts, there are different rules. For example, if looking up song names, the strings "jalapeno" and "jalapeño" should be considered equal, in English text at least.

Someone · on Dec 31, 2021

That doesn’t make sense to me. Even disregarding cases where people mix languages (how do you write a dictionary? If the answer is “just create a data structure combining multiple strings”, shouldn’t we standardize how to do that?), all languages share thousands of symbols such as currency symbols, mathematical symbols, Greek and Hebrew alphabets (to be used in math books written in the language), etc. So, even languages such as Greek and English share way more symbols than that they have unique ones.

josefx · on Dec 31, 2021

> being able to mix arbitrary languages into a single String object

Unless I missed something that is impossible with Unicode. Mixing multiple languages would require a way to specify the language used for case conversion, sorting and font rendering settings mid string and I don't think that Unicode has that. For example try to write a program that correctly uppercase a single string containing both an English i and a Turkish i in your favorite Unicode supporting language, the code point is the same for both, and you generally only get to specify one language globally or per function call.

wongarsu · on Dec 31, 2021

You can write a string with words from multiple languages, you just can't easily modify it with operations like case conversion. But sorting shouldn't depend on the origin language anyway, it depends on the language of the reader. All words in an English dictionary are sorted in "English" order

josefx · on Jan 2, 2022

Displaying is also questionable. If you want japanese/chinese/etc. rendered correctly in your browser you have to mark corresponding sections with a language tag, they have different rules on how several abstract graphemes shared between them should be rendered (amount and shape of strokes).

aliceryhl · on Dec 31, 2021

I mix languages all the time.

otabdeveloper4 · on Jan 1, 2022

> It might have been better if the 'code pages' idea was refined instead of eliminated

Obviously yes, it would have been better.

But Unicode was designed by the same people who designed ASCII - monolingual Americans who never had to deal on a daily basis with anything that doesn't fit into the 26 letters of the English alphabet. So here we are.

gumby · on Jan 1, 2022

> But Unicode was designed by the same people who designed ASCII - monolingual Americans who never had to deal on a daily basis with anything that doesn't fit into the 26 letters of the English alphabet.

This is not even remotely true.

otabdeveloper4 · on Jan 1, 2022

Yes it is. The people who made Unicode went out of their way to make life hard and offensive for CJK, Cyrillic, etc., users.

gumby · on Jan 2, 2022

Since you could watch this all unfold in real time (and can go back and read mailing list archives back ab initio) you can easily see the participants and their positions and arguments. The “people who made Unicode” came from and continue to come from all over the world.

Also, do you remember what (polyglot computing) life was like before Unicode?

tsimionescu · on Jan 1, 2022

A major problem with Unicode is that it gives you strange ideas about text: that you can somehow take human text encoded in Unicode and answer questions like "how many letters does this have" or "are these two pieces of text different" or "split this text into words" in a way that works generically for any langauge or context.

These are all myths, and APIs for such things are bugs. The only thing you can meaningfully do with two pieces of arbitrary Unicode text is to say if they are byte-by-byte equal. For any other operation, you need to have specific business logic.

For example, are "Ionuț" and "Ionut" and "Ionutz" the same string or different strings? There is no generic answer: depending on the intended business logic, they may be identical or not (e.g. if we consider these to be Romanian names, they should be considered identical for search purposes, but probably not identical for storage purposes, where you want to remember exactly how the person spelled their name).

A related problem is that most langauges have no separate types for Text/String on one hand, and Symbol on the other. Text or other Strings should be opaque human text, that can only be interpreted by specific code, offering almost no API (only code point iteration). Symbols should be a restricted subset of Unicode that can offer fuller features, such as lengths, equality, separation into words etc. This would be the type of JSON key names used in serialization and deserialization, for example.

rstuart4133 · on Jan 1, 2022

What they should have done is not that strange. Text is merely an ordered collection of characters. If you just assigned each character (aka grapheme) a number, text becomes a sequence of numbers. The first two questions you pose, "how many letters does this have" and "are these two pieces of text different" are trivially answered by such a representation. Unicode's fuck up is the managed to come up with something that can not reliably answer those two questions.

In fact what Unicode has end up with is so horrible, it's a major exercise in coding just to answer a simple question like "is there an 'o' in this sentence", as in Python3's "'o' in sentence" does not always return the right result.

Unicode's starting point was all wrong. There is an encoding that did a perfectly good job of mapping graphemes to numbers: ISO-10646. In fact Unicode is based on it, by then committed their original sin: they decided all the proposed ISO-10646 encodings (ie, how the numbers are encoding into byte streams) were crap, so they released a standard that combined two concepts that should have remained orthogonal: codepoints and encoding those codepoints to a binary stream.

Now it's true ISO-10646 proposed encodings were undercooked. That became painfully apparent when Ken Thompson came up with utf-8. But no biggie right: utf-8 was just another ISO-10646 encoding, just let it take over naturally. The Unicode solution to the encoding problem was to first decide we would never need more than 2^16 codepoints, then wrap it up in "one true encoding everyone can use": UCS2. Windows and Java, among others, bought the concept, and have paid the price ever since.

They were wrong of course. 2^16 was not enough. So they replaced the USC2 encoding with UTF-16 which was sort of backwards compatible. But not one UTF-16, oh no, that would be too simple. We got UFT-16LE and UTF-16BE. Notice what has happened here: take identical pieces of text, encode them as valid Unicode, and end up with two binary objects that were different. Way to go boys!

But that wasn't the worst of it: they managed to screw up UTF-16 so badly it didn't expand the code space to the 2^32 points, just 2^20. And in case you can't guess what happens next, I tell you: turns out there are more than 2^20 grapheme's out there.

What to do? Well there are a lot of characters that are "minor variants” of each other, like, like o and ö. Now Unicode already had a single code point for ö but to make it all fit and be uniform they decided "Combining Diaeresis” was the way these things should be done in future. So now the correct way to represent ö is a code point that says "add an umlaut to the next character (provided it isn't another diaeresis)" followed by the code point for o. But as the original codepoint for ö still exists, we can have two identical graphemes that don't compare as equal under Unicode, which is how we get to ö ≠ ö.

So it's not only Python3 "'o' in sentence" that doesn't always always work. We arrived at the point that "'ö' in sentence" can't be done without some heavy lifting that must be done by a library. Just to make it plain: some CPU's can do "'o' in sentence" in a single instruction. That simple design decison have lost is orders of magnitude in CPU efficiency.

I know these are strong words, but IMO this is a brain dead, monumental fuckup, making things acre feet, furlong fortnights look positively sane. It's time to abandon Unicode, and it's “Combining Diaeresis” in particular and go back to basics: ISO-10646 and utf-8. UTF-8 provides a 28 bit encoding space, which is more than enough to realise the one the single guiding principle that ISO-10646 was founded on: one codepoint per grapheme.

It won’t happen of course, so as a programmer I’ll have to deal with the shit sandwich the Unicode consortium has served up for the rest of my life.

tsimionescu · on Jan 2, 2022

While one codepoint per grapheme would be nice, it still wouldn't solve text. There are also problems like RTL and LTR writing systems that need to be combined into the same text.

And, many of the examples I gave earlier will not go away. The problem of similar URLs using different characters would be smaller, but not gone - microsoft.com and mícrosoft.com still look too similar. Text search should still support alternate spellings (color and colour). People's names would still have multiple legally identical spellings.

chrismorgan · on Dec 31, 2021

A fun related issue that could occur: applying NFD to a string can make it longer, so a sanitiser that limits file names to 255 UTF-16 code units but doesn’t first normalise to NFD could fail on HFS+.

This could occur on systems that normalise to NFC as well: NFC lengthens some strings, e.g. 𝅗𝅥 (U+1D15E MUSICAL SYMBOL HALF NOTE) normalises to 𝅗𝅥 (U+1D157 MUSICAL SYMBOL VOID NOTEHEAD, U+1D165 MUSICAL SYMBOL COMBINING STEM) in both NFC and NFD (similar happens in various Indic scripts, pointed Hebrew, and the isolated case of U+2ADC FORKING which is special for reasons UAX #15 explains), but I don’t think there are any file systems that actually normalise to NFC? (APFS prefers NFC, but doesn’t normalise at the file system level.)

The remaining concern would be that NFC could take more UTF-8 code units than NFD despite adding a character, but in practice this doesn’t occur (checked on NormalizationTest-3.2.0.txt).

vvhn · on Jan 1, 2022

APFS doesn’t “prefer” anything - it it will not change the bytes passed to NFC or NFD. The bytes passed for creation are stored as is ( HFS will store the NFD form on disk if you pass the NFC form to it). However APFS is normalization insensitive (if you create a NFC name on disk , you won’t be able to create the NFD version and you will be able to the name by both the NFC and NFD variants) just as HFS is - they both use different mechanisms to achieve normalization insensitivity.

kingcharles · on Jan 1, 2022

That's a really great point about the string-length and not often addressed. You might even be able to force some sort of buffer overflow with that I guess.

misnome · on Dec 31, 2021

Why isn’t the answer just “Don’t unicode normalise the file name”?

I thought the generally recommended way to deal with file names is to treat as a block of bytes (to the extent that e.g. rust has an entirely separate string type for OS provided strings), or just to allow direct encoding/decoding but not normalisation or alteration.

jrochkind1 · on Dec 31, 2021

Well, precisely because if you don't normalize the filenames, ö ≠ ö. You could have two files with different filenames, `göteborg.txt` and `göteborg.txt`, and they are different files with different filenames.

Or you could have one file `göteborg.txt`, and when you try to ask for it as `göteborg.txt`, the system tells you "no file by that name".

Unicode normalization is the solution to this. And the unicode normalization algorithms are pretty good. The bug in this case is that the system did not apply unicode normalization consistently. It required a non-default config option to be turned on to do so? I don't really understand what's going on here, but it sounds like a bug in the system to me that this would be a non-default config option.

Dealing with the entire universe of human language is inherently complicated. But unicode gives us some actually pretty marvelous tools for doing it consistently and reasonably. But you still have to use them, and use them right, and with all software bugs are possible.

But I don't think you get fewer crazy edge cases by not normalizing at all. (In some cases you can even get security concerns, think about usernames and the risk of `jöhn` and `jöhn` being two different users...). I know that this is the choice some traditional/legacy OSs/file systems make, in order to keep pre-unicode-hegemony backwards compat. It has problems as well. I think the right choice for any greenfield possibilities is consistent unicode normalization, so `göteborg.txt` and `göteborg.txt` can't be two different files with two different filenames.

[btw I tried to actually use the two common different forms of ö in this text; I don't believe HN normalizes them so they should remain.]

nieve · on Dec 31, 2021

It looks like instead of the config option switching everything to use the same normalization it keeps a second copy of the name in a database to compare to. What a horrible kludge, I wonder how they even got into this situation of using different normalization in different parts of the system?

jrochkind1 · on Jan 2, 2022

That seems an odd choice indeed, because even if you do have different normalizations in differnet parts of the system, you don't need to keep multiple copies -- you just need to apply the right normalization in the right place. All of the unicode normalization algorithms are both idempotent and of course completely deterministic. If you apply NFD to any legal input, you get the same thing every time -- there's no need to store the NFC version separately to compare it to NFC input when all you have is NFD otherwise, you can just normalize the input to NFD to compare it to what you have!

Unless it was meant to be for performance?

tialaramex · on Dec 31, 2021

In terms of what filenames are neither Windows nor Linux (I don't know for sure with MacOS but I doubt it) actually guarantee you any sort of characters.

Linux filenames are a sequence of non-zero bytes (they might be ASCII, or at least UTF-8, they might be an old 8-bit charset, but they also might just be arbitrary non-zero bytes) and Windows file names are a sequence of non-zero 16-bit unsigned integers, which you could think of as UTF-16 code units but they don't promise to encode UTF-16.

Probably the files have human readable names, but, maybe not. If you're accepting command line file names it's not crazy to insist on human readable (thus, Unicode) names, but if you process arbitrary input files you didn't create, particularly files you just found by looking around on disks unsupervised - you need to accept that utter gibberish is inevitable sooner or later and you must cope with that successfully.

Rust's OSStr variants match this reality.

atoav · on Dec 31, 2021

This is what I found quite refreshing about Rust — instead of choosing one of the following:

  A) The programmer is a almighty god who knows everything, we just expose him to the raw thing
  
  B) The programmer is a immature toddler who cannot be trusted, so we handle things for them

What Rust does is more among the lines of "you might already know this, but anyways here is a reminder that you, the programmer need to take some decision about this".

zekica · on Dec 31, 2021

macOS is interesting: some APIs normalize filenames while others don't. And it causes some very interesting bugs.

One example is when you submit a file in Safari it doesn't normalize the file name while js file.name does.

GlitchMr · on Dec 31, 2021

Filenames in HFS+ filesystem (an old filesystem used by Mac OS X) are normalized with a proprietary variant of NFD - this is a filesystem feature. APFS removed this feature.

lilyball · on Dec 31, 2021

By “proprietary variant” you mean “publicly documented variant” which IIRC is just the normalization tables frozen in time from an early version of Unicode (the idea being that updating your OS shouldn’t change the rules about what filenames are valid).

As for APFS, it ~~doesn’t~~didn’t normalize but I believe it still requires UTF-8. And the OS will normalize filenames at a higher level. EDIT: they added native normalization. At least for iOS, I didn’t dig enough to check it macOS is doing native normalizing or is just normalization-insensitive.

chrismorgan · on Dec 31, 2021

Normalisation is expressly done with the composition of version 3.1 for compatibility: see <https://www.unicode.org/reports/tr15/#Versioning>. IF that’s what HFS+ does, then “proprietary variant” is wrong. And if not, I’m curious what it does differently.

(On the use of version 3.1, note that in practice version 3.2 is used, correcting one typo: see <https://www.unicode.org/versions/corrigendum3.html>.)

I find a few references to it being slightly different, but not one of them actually says what’s different; Wikipedia is the only one with a citation (<https://en.wikipedia.org/wiki/HFS_Plus>: “and normalized to a form very nearly the same as Unicode Normalization Form D (NFD)[12]”), and that citation says it’s UAX #15 NFD, no deviations. One library that handles HFS+ differently switches to UCD 3.2.0 for HFS+ <https://github.com/ksze/filename-sanitizer/blob/e990e963dc5b...>, but my impression from UAX #15 is that this should be superfluous, not actually changing anything. (Why is UCD 3.2.0 still around there? Probably because IDNA 2003 needs it: <https://bugs.python.org/issue42157#msg379674>.)

Update: https://developer.apple.com/library/archive/technotes/tn/tn1... has actual technical information, but the table in question doesn’t show Unicode version changes like they claim it does, so I dunno. Looks like maybe from macOS 10.3 it’s exactly UAX #15, but 8.1–10.2 was a precursor? I’m fuzzy on where the normalisation actually happens, anyway.

GlitchMr · on Dec 31, 2021

The `filename-sanitizer` library you have linked has the following comment.

                # FIXME: improve HFS+ handling, because it does not use the standard NFD. It's
                # close, but it's not exactly the same thing.
                'hfs+': (255, 'characters', 'utf-16', 'NFD'),

I wonder what does that mean...

lilyball · on Jan 1, 2022

The technote linked by the parent has a note saying

> The characters with codes in the range u+2000 through u+2FFF are punctuation, symbols, dingbats, arrows, box drawing, etc. The u+24xx block, for example, has single characters for things like "(a)". The characters in this range are not fully decomposed; they are left unchanged in HFS Plus strings. This allows strings in Mac OS encodings to be converted to Unicode and back without loss of information. This is not unnatural since a user would not necessarily expect a dingbat "(a)" to be equivalent to the three character sequence "(", "a", ")" in a file name.

> The characters in the range u+F900 through u+FAFF are CJK compatibility ideographs, and are not decomposed in HFS Plus strings.

The bit about the u+24xx block is misleading, the decomposition of the characters I looked at there (such as ⒜) are compatibility canonicalizations. However the CJK compatibility ideographs is a working example. U+F902 (車) decomposes to U+8ECA (車) regardless of normalization form but the technote says these must not be decomposed.

matja · on Dec 31, 2021

ZFS can support normalization also:

    $ echo test > $'\xc3\xb6'
    $ cat $'\x6f\xcc\x88'
    cat: ö: No such file or directory

    $ zfs create -o normalization=formD pool/dataset
    $ echo test > $'\xc3\xb6'
    $ cat $'\x6f\xcc\x88'
    test

1over137 · on Dec 31, 2021

>APFS removed this feature.

And then brought it back. It normalizes now.

stefan_ · on Dec 31, 2021

Sure but at some point you might want to create a file and frequently using user input or filter files using some user provided query string, the kind of use cases that unicode normalization was invented for. So the whole "opaque blob of bytes" filesystem handling is nice if all you want is to not silently corrupt files, but it is very obviously not even covering 10% of normal use cases. Rust isn't being super smart, it just has its hands thrown up in the air.

pavlov · on Dec 31, 2021

The most common desktop file systems are case-insensitive, which complicates the picture.

Pxtl · on Dec 31, 2021

Still, it looks like the right thing to do is let the filesystem do the filesystem's job. The filesystem should be normalizing unicode and enforceing the case-insensitivity and whatnot, but just the filesystem. Wrappers around it like whatever Nextcloud is doing should be treating the filenames as a dumb pile of bytes.

dataflow · on Dec 31, 2021

I'm not sure this problem even has a "right" solution.

> Wrappers around it like whatever Nextcloud is doing should be treating the filenames as a dumb pile of bytes.

What do you do when the input isn't a dumb pile of bytes, but actual text? (Like from a text box the user typed into?)

rzzzt · on Dec 31, 2021

Maintain a table that maps the original file name to random-generated one that doesn't hit these gotchas.

dataflow · on Dec 31, 2021

I'm afraid I don't follow. Who maintains this table and who consumes it? What if they're different entities? How do you prevent it from going out of sync with the file system when the user renames a file? Are you inventing your own file system here? How do you deal with existing file systems?

rzzzt · on Dec 31, 2021

I assumed that you have a system where file management/synchronization happens strictly through a web interface, and files are not changed or renamed outside this system's knowledge. Under these preconditions, having such a mapping table frees the users from having to abide whatever restrictions the underlying file system places on valid file names.

dataflow · on Dec 31, 2021

Oh I was talking about the general case from a programming standpoint. What do you do on a typical local filesystem?

The point I'm trying to get at being, you need to worry about the representation at multiple layers, not just at the bottom FS layer.

rob_c · on Dec 31, 2021

And place the files in chunks, and... Wait I think we're getting close to reinventing block storage again ;)

mjevans · on Dec 31, 2021

Case insensitivity is a braindead behavior. If desired it should be a fallback path selecting the best match, not the first resort.

jodrellblank · on Dec 31, 2021

The opposite; case insensitivity is what human brains do, we read word WORD Word and woRD as the same thing, it's computer case-sensitive matching which is "brainless". Computers not aligning with what humans do is annoying and frustrating; they should be tools for us, not us for them. There's no way two people would write ö ö and have readers think they were different because one was written in oil-based ink and one in water-based ink, or whatever compares with behind the scenes implementation details like combining form vs single character.

I have just been arguing the same thing in far too much detail in this thread: https://news.ycombinator.com/item?id=29722019

cyphar · on Jan 1, 2022

Case insensitivity and "what human brains do" becomes incredibly complicated outside of English. There are also many other things which human brains recognize as the same thing but would be unreasonable to implement in filesystems.

In Japanese, くるま, クルマ, and 車 are all the same word (the first two are the phonetic spelling "kuruma", the later is the Chinese character). However in order to know that 車 is read くるま you need to be a native Japanese speaker (or have a dictionary) -- should filesystems have dictionaries to match what a human would think? Search engines that support Japanese have to handle this to some degree, but I humbly suggest implement Google Search's language handling code into a filesystem would be an ill-advised decision.

If you wanted to implement the most minimal version of this you would map between katakana and hiragana, but that means you'll need to do this for other languages. For instance, Serbian. Serbian uses two scripts (both of which have upper and lower case forms) and any native Serbian speaker would see "tuđa ljuta paprika" and "туђа љута паприка" as the same text (note that lj became љ). Should that also be automatically translated in the filesystem?

In German, capitalisation is not reversible. ß becomes SS when capitalised but will be lowercased as ss. (There is now a capital version -- ẞ -- but from what I gather it's not widely used.)

Even in English you have British and American spellings of a given word -- native speakers would recognise them as the same thing but it would not be reasonable to expect a filesystem to map them to the same thing. Initialisms can have different identical representations (N.S.A vs NSA). And you also have cases where capitalisation actually does distinguish words (May vs may, PRISM vs prism, CAT vs cat, etc). What about fullwidth and halfwidth latin characters (Ｈｅｌｌｏ vs Hello)? Arguably those are even more identical than upper and lower case.

For all of the above reasons, case insensitivity is something which most systems will only ever implement for English and a few other European languages, meaning that it's more of a wart than a fully-working feature. If the argument really is "well, a human would recognise these two names as the same thing, so the filesystem should too" then why are none of the other examples given above handled? If it's too difficult to do correctly (which is my view) then why support any of this in the first place? However, everything should be normalised (NFC or NFD depending on your usecase).

skymt · on Dec 31, 2021

There are a couple arguments against case-insensitive filesystems I think are strong. The first is simply compatibility with existing case-sensitive systems. The second is that case is locale-dependent, so a pair of names could be equivalent or not depending on the device's locale.

I don't think I've seen any good argument against normalization, though.

DangitBobby · on Dec 31, 2021

> word WORD Word and woRD as the same thing

I don't know about anyone else, but I read WORD as someone yelling, Word as designating/specifying a "word" with some importance, and woRD as the mocking Spongebob meme. I absolutely don't read "case insensitive" and I don't think filesystems should either.

jodrellblank · on Dec 31, 2021

You read DOG as someone yelling ‘dog’, not as a different word to ‘dog’. And Dog as a significant dog, not a significant something else.

Imagine if you could only search for ‘dog’ if you had to specify whether the author yelled it or not before you could find it.

DangitBobby · on Jan 1, 2022

It sounds like you're saying that cases should matter in some ways but not in others, which I take no issue with.

jodrellblank · on Jan 1, 2022

Case can have information in it, like color and underline and boldface and italics can carry information. I think it would be clever if Google let me colour my search text and then only found text which was rendered in the same colour, but terrible if colouring my search text was mandatory and it then only found pages with text in the same colour. Likewise terrible if your code editor searched only for code with syntax highlighting matching the colours you typed in the search box.

Dog in bold, italics, red, green, uppercase, lowercase, initialcaps, smallcaps, are all the same word. What "the same" means has fuzzy boundaries and sometimes needs very precise specification, but I personally want the default to be the fuzzy convenient and the hyper-literal to be available as a fallback.

[I notice that I used 'color' and 'colour' here. My native language is UK English and programming languages and much of the internet use US English. I'm not sure if I would want `vim colour.txt` to open `color.txt`. Probably not. PowerShell 7 has a suggestions feature for "you typed a command which wasn't found, here are the most similar command names:" - mentioned in https://github.com/PowerShell/PowerShell/issues/10546 ]

josefx · on Jan 1, 2022

> Dog in bold, italics, red, green, uppercase, lowercase, initialcaps, smallcaps, are all the same word.

But that is a feature of the word Dog, try the German words maßen (limited amounts) and massen (large amounts), historically they share the same upper case rendering MASSEN. Now someone versed in German could change to the alternative MASZEN or use the rather modern upper case version of ß. However the default naive (and most of the time correct) conversion between cases looses a significant amount of information.

rob_c · on Jan 1, 2022

Have to agree. It's also usually only about 10 lines of code to support both insensitive and sensitive searching for those who can't read English that way.

jodrellblank · on Jan 1, 2022

You've done the same thing here as your other comment. Here suggesting that people "can't read English" and in your other comment suggesting that people "can't get their head around capslock and don't deserve support".

What about people who CAN read English that way, but think having to match case when searching or referencing text hinders more than it helps?

rob_c · on Jan 2, 2022

Then the search tool should support not being case sensitive. I understand the efficiency of case insensitive search (otherwise Google's empire wouldn't exist). But having it enforced as the source of all truth is just broken.

zajio1am · on Jan 1, 2022

> The opposite; case insensitivity is what human brains do, we read word WORD Word and woRD as the same thing, it's computer case-sensitive matching which is "brainless".

Only if different case-variants do not have meaning. When two words that differ only in case have different meaning, we distinguish them (e.g. "moon" and "Moon").

jodrellblank · on Jan 1, 2022

Assuming you mean the distinction between moons in general and the Earth's Moon (Luna), if I wrote "neil armstrong was the first human to walk on the moon" would you think I meant anything other than Neil Armstrong walking on The Moon?

Meaning doesn't go when the case changes in anything like the way meaning goes when the letters change. "neil armstrong was the first human to walk on the roof" is a very different sentence, you can't get anything like that difference with just case changes. If I spoke it, you wouldn't be able to tell if I spoke the correct case or not. Would you want school children searching for "one small step for man, one giant leap for mankind" and Google says "no results" because they used a lowercase m in mankind? Would you want a TV quiz show asking "What is Europa?" and a contenstant answers "a moon of jupiter" and the host asks "do you mean moon with a capital m or lowercase m?" before they decide whether the answer is correct?

rob_c · on Dec 31, 2021

WORD, Word WoRD....

Sorry to say I tend to use case sensitivity as a filter for me offering support to other developers. I'm not willing to find time for people who can't get their head around "turn on/off caps lock". You don't do it in professional writeups or applications (and I hope not in a CV) so don't pollute my filesystems or codebases with that madness.

jodrellblank · on Dec 31, 2021

I’m not talking about caps lock. I can get my head around case sensitivity, I can use it, it’s worse, I don’t want to have to use it anymore than I want to use filesystem permissions in octal even though I can. Having tools take chmod u+r is easier and doesn’t change the filesystem at all.

rob_c · on Jan 2, 2022

Sorry, not sure I see the point here other than computers provide human representation of binary data?

If the mapping is non trivial then unless you're careful you end up breaking basic consistency between input and stored data hence the weird issues with mangling the unicode chars. If the mapping is trivial theres almost nothing to discuss. If the mapping is many to many you're going to have a bad time unless your consistent with your use of the maps. Then the fun is broken mappings where you get data loss due to incorrect many to one and one to many mappings.

There are times when caps matter, I e. code and filesystems are human readable so should not be arbitrary, but searching these for instance makes sense to be insensitive when needed (perhaps even default)

laurent92 · on Dec 31, 2021

So you’re fine with ~/Downloads and ~/downloads coexisting as entirely separate directories? And [email protected] and [email protected] being attributed to two different people ;)

vgel · on Dec 31, 2021

First one: yes, though good UI should prevent it from happening unless the user really intended it (for example I have ~/Documents symlinked into Dropbox, so ~/documents could be local-only documents)

Second one: no, emails are not filenames, and more generally distinguishability is more important for identifiers. In cases where identifiers like emails need to be mapped to filenames, like caches, they should be normalized.

im3w1l · on Dec 31, 2021

> So you’re fine with ~/Downloads and ~/downloads coexisting as entirely separate directories?

Case (in)sensitivity for filenames is a non-issue in my experience. Never had problems with either convention. As for emails, I do think insensitivity was the right choice.

tim-- · on Dec 31, 2021

The RFC states that email addresses are case sensitive.

The local-part of a mailbox MUST BE treated as case sensitive.

Section 2.4 RFC 2821, https://www.ietf.org/rfc/rfc2821.txt

im3w1l · on Jan 1, 2022

Ah interesting. I guess the case insensitivity (for incoming email) is a decision of the popular services then, like gmails decision to consider johndoe equivalent to john.doe.

deadbunny · on Dec 31, 2021

My guess would be that the local part of an email address would usually map to a directory on case sensitive filesystems...

justaguy37 · on Dec 31, 2021

can we just say no to capital letters? (or lowercase?)

do capital letters have a good enough usage case to justify their continued existence?

colejohnson66 · on Dec 31, 2021

You are free to stop using capital letters, but good luck getting everyone to go along. Capitals have been around for centuries (they’re older than the printing press) and aren’t going anywhere.

bmn__ · on Jan 1, 2022

The lower-case letters in Greek/Latin/Cyrillic are the new additions, initially we only had what is now called upper-case.

Lammy · on Dec 31, 2021

Fun fact: The Apple Ⅱ and Ⅱ+ originally only did upper-case, and it was very popular to add a Shift Key / lower-case mod via one of the gamepad buttons: https://web.archive.org/web/20010212094858/http://home.swbel...

arka2147483647 · on Dec 31, 2021

That works for programmers, but not for users. There could be several files with the same name, buth with different encodings. Worse, depending on how your terminal encodes user input, some of them migth not be typable.

zarzavat · on Dec 31, 2021

From the users perspective I don't want any normalisation at all. It's good as long as you only have one file system but as soon as you get multiple file systems with conflicting rules (which includes transferring files to other people) it becomes hell. Unfortunately we are stuck with that hell.

alkonaut · on Dec 31, 2021

Falls over on the fact that I don’t want to be able to write these two files in the same dir. if I write file ö1.txt and ö1.txt then I want to be warned that the file exists even of the encoding is different when I use two different apps but try to write the same file.

The same applies for a.txt and A.txt on case insensitive file systems (as someone pointed out the most common desktop file systems are).

0x0 · on Dec 31, 2021

Java is terrible in this regard, as most file APIs use "java.lang.String" to identify the filename, which most of the time depends on the system property "file.encoding". With the result that there will be files that you can never read from a java application if the filename encoding does not match the java file.encoding encoding.

mannerheim · on Dec 31, 2021

Duolingo doesn't handle Unicode normalisation for certain languages, and it's incredibly frustrating. Here's one example[0] (Vietnamese) and I know it's the case for Yiddish as well.

[0]: https://forum.duolingo.com/comment/17787660/Bug-Correct-Viet...

rurban · on Jan 1, 2022

A filesystem accepting only NFD should be filed as bug. They can normalize it internally to NFD, as Apples previous HFS+ did.

But even worse than that is Python's NFKC, which normalizes ℌ to H and so on. The recommended normalizations are NFC for offline normalization (like in compiled languages and databases) and NFD for online, where speed trump's space. unicode.org talking that much about NFKC was a big mistake. NFKC is crazy and doesn't even roundtrip. The whole TR31 XID_Start/Continue sets are mostly because of NFKC issues, not so about stability. But people bought it for its stability argument.

I'm just writing a library and linter for such issues: https://github.com/rurban/libu8ident

Also note that C++23 will most likely enforce NFC identifiers only. Same problem as with this filesystem. My implementation was to accept all normal. forms and store it internally and in the object files as NFC. The C ABI should declare it also. Currently they don't care as much as Linux filesystems: Nada. Identifiers being unidentifiable

guerrilla · on Dec 31, 2021

> But here, normalization caused this issue.

Nope, the lack of normalization on both accounts by the SMB server caused the issue. It could have normalized before emitting but it definitely should have normalized on receiving for comparison.

B-Con · on Dec 31, 2021

I think that in the ls->read workflow, Nextcloud shouldn't normalize the response from SMB and should issue back to SMB whatever SMB returned to Nextcloud.

guerrilla · on Dec 31, 2021

According to Unicode, it should be allowed to and the SMB server should be able to handle it. That's kind of the point of normalization, they're meant to be done before all comparisons so that exactly this doesn't happen. Your suggestion is just premature optimization, i.e. eliminating a redundancy.

int_19h · on Dec 31, 2021

Unicode doesn't say anything about what "should be allowed to" with respect to an unrelated protocol. If the protocol says that filenames are sequences of 16-bit values that have to be compared one by one, then that's what it is.

guerrilla · on Dec 31, 2021

It does say that if comparisons are being made then... and comparisons are being made, so yes, it does.

int_19h · on Jan 1, 2022

If comparisons are being made of Unicode strings, sure. Does the protocol actually defines the identifier in question as a Unicode string, though? Or as an array of 16-bit ints?

silon42 · on Dec 31, 2021

At least it should perform validation and reject the NFD form and force the client to normalize to NFC?

rurban · on Jan 2, 2022

NFD is fine when you dont have much time and can afford the space. NFC is about 3x slower and smaller. Forcing clients never works. Be tolerant what you accept and strict what you write. In the case of C++23 enforcing NFC my mind is twisted. It would allow heavy tokenizer optimizations, but this is an offline compiler, where you don't really need that.

The problem are the compatible variants, NFKC and NFKD. But then you have usecases where you need them, and more to actually find strings. Levenshtein should not be the default when searching for strings.

tpmx · on Dec 31, 2021

As a northern European, I kinda miss iso-8859-1 being used everywhere back in the mid 90s.

sharikous · on Jan 1, 2022

As a Middle-Eastern I still dread those times

a1371 · on Jan 1, 2022

Eventually someone will write a string matching library that renders the characters on an internal canvas and diffs the pixels instead.

WalterBright · on Jan 1, 2022

When Unicode adopted normalization it went off the rails into mudville. Then, determined to make a mockery of its purpose, it adopted semantic meanings, fonts, and then started inventing all sorts of new characters.

"If you vote for my nutburger glyph my kid drew for a kindergarten assignment, I'll vote for the chicken scratching you noticed in the barnyard dust."

nixpulvis · on Dec 31, 2021

Half normal isn't normal. That said, I personally try to avoid unicode in filenames (and caps too) for similar reasons.

kingcharles · on Jan 1, 2022

And then some jerk comes along and writes ꙮ one time, in one document, in the whole of human history, just for the lulz, and the next thing you know we're saying hello to a new log4j.

https://en.wikipedia.org/wiki/Multiocular_O

mgaunard · on Dec 31, 2021

Most formats (including XML) require data to be normalized to NFC.

chrismorgan · on Dec 31, 2021

Can you point me to a single format that actually requires NFC? Most things either make no comment or just express preferences, though I’m confident there will be some somewhere.

XML does not require normalisation: per <https://www.w3.org/TR/xml11/#sec-normalization-checking>, XML data SHOULD be fully normalised, but MUST NOT be transformed by processors; in other words, it’s a dead letter “SHOULD”, and no one actually cares, just like almost everything else.

jrochkind1 · on Dec 31, 2021

it seems like a bug that to get consistent unicode normalization you need to flip a non-default config option. What am I missing?

h0nd · on Jan 3, 2022

Reminds me of the old pаypal.com scam, where а != a.

heikkilevanto · on Dec 31, 2021

Well, if 7-bit US ASCII was good enough for our Lord, it is good enough for me ;-)

kingcharles · on Jan 1, 2022

Well.. if we're getting technical, the "Old" Testament is written in Hebrew, and the "New" Testament in written in Greek.

The first line of Genesis reads thus: (from right-to-left, although the earliest Hebrew is actually LTR) בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ.

And the beginning of the Gospel of Mark thus: Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ Χριστοῦ υἱοῦ θεοῦ.

(and if we're getting super technical there are a bunch of Aramaic phrases in the Bible that Jesus spoke, although I know little of Aramaic and I don't know how it would have been written in Biblical times between the Greek characters)

So the Lord would be needing those 16-bits after all..

javajosh · on Dec 31, 2021

tl;dr - don't use crazy unicode characters in filenames, they can be problematic for non-trivial reasons (in this case because of unicode normalization on an smb mount.)

int_19h · on Dec 31, 2021

What's "crazy" about the letter? It's a standard letter of several European alphabets.

drpixie · on Dec 31, 2021

Nothing crazy about the "letter", but it is crazy that there are multiple different ways to encode the "letter".

dagmx · on Jan 1, 2022

A user wouldn't know that there are multiple ways to encode a given character unless they're experienced with Unicode.

Additionally there are (iirc) multiple ways to encode characters even in the ASCII set.

This is purely a failing in consistent normalization schemes.

kortex · on Jan 1, 2022

So, no combining characters? Ok, even if you rule out Latin characters with accents and only use code points that consider the character with accent a single entity... you still have world languages which need combining in order to work, which means you can't really escape multiple encodings of the same "graphemes" (these languages don't exactly have "letters" like ASCII does).