Here's the thing: I don't want to work in UTF8. I want to work in Unicode. Big d...

naniwaduni · on Jan 25, 2021

Every circumstance. Why do you consider it unviable? What problems do you think having a Unicode sequence solves?

lolc · on Jan 25, 2021

Convince me. Here's a little library function that turns text into a set of words:

    def keywords(text):
        return set(filter(None, re.split("\W+", unicodedata.normalize("NFKC", input_str).lower())))

How would this look if strings were byte-arrays? How would `normalize()`, `lower()`, and `split()` know what encoding to use?

The way I see it: If the encoding is implicit, you have global state. If it's explicit, you have to pass the encoding. Both is extra state to worry about. When the passed value is a Unicode string, this question doesn't come up.

naniwaduni · on Jan 25, 2021

It looks pretty much the some, except that you assume the input is already in your library's canonical encoding (probably utf-8 nowadays).

I realize this sounds like a total cop-out, but when the use-case is destructively best-effort tokenizing an input string using library functions, it doesn't really matter whether your internal encoding is utf-32 or utf-8. I mean, under the hood, normalize still has to map arbitrary-length sequences to arbitrary-length sequences even when working with utf-32 (see: unicodedata.normalize("NFKC", "a\u0301 ﬃ") == "\xe1 ffi").

So on the happy path, you don't see much of a difference.

The main observable difference is that if you take input without decoding it explicitly, then the always-decode approach has already crashed long before reaching this function, while the assume-the-encoding approach probably spouts gibberish at this point. And sure, there are plenty of plausible scenarios where you'd rather get the crash than subtly broken behaviour. But ... I don't see this reasonably being one of them, considering that you're apparently okay with discarding all \W+.

suzuki · on Jan 25, 2021

I agree with you. I wish Python 3 had strings as byte sequences mainly in UTF-8 as Python 2 had once and Go has now. Then things would be kept simple in Japan. Python 3 feels cumbersome. To handle a raw input as a string, you must decode it in some encoding first. It is a fragile process. It would be adequate to treat the input bytes transparently and put an optional stage to convert other encodings to UTF-8 if necessary.

lolc · on Jan 25, 2021

I know this from PHP, where I have to be aware of the encoding the strings are in. I still don't see what the advantage should be of that.