Here's the thing: I don't want to work in UTF8. I want to work in Unicode. Big difference. Because tracking the encoding of my strings would increase complexity. So at the earliest convenience, I validate my assumptions about encoding and let a lower layer handle it from then on.
I understand you're arguing about some sort of equivalency between byte-arrays and Unicode strings. Sure there are half-baked ways to do word-splitting on a byte-array. But why do you consider that a viable option? Under what circumstances would you do that?
How would this look if strings were byte-arrays? How would `normalize()`, `lower()`, and `split()` know what encoding to use?
The way I see it: If the encoding is implicit, you have global state. If it's explicit, you have to pass the encoding. Both is extra state to worry about. When the passed value is a Unicode string, this question doesn't come up.
It looks pretty much the some, except that you assume the input is already in your library's canonical encoding (probably utf-8 nowadays).
I realize this sounds like a total cop-out, but when the use-case is destructively best-effort tokenizing an input string using library functions, it doesn't really matter whether your internal encoding is utf-32 or utf-8. I mean, under the hood, normalize still has to map arbitrary-length sequences to arbitrary-length sequences even when working with utf-32 (see: unicodedata.normalize("NFKC", "a\u0301 ffi") == "\xe1 ffi").
So on the happy path, you don't see much of a difference.
The main observable difference is that if you take input without decoding it explicitly, then the always-decode approach has already crashed long before reaching this function, while the assume-the-encoding approach probably spouts gibberish at this point. And sure, there are plenty of plausible scenarios where you'd rather get the crash than subtly broken behaviour. But ... I don't see this reasonably being one of them, considering that you're apparently okay with discarding all \W+.
I agree with you. I wish Python 3 had strings as byte sequences mainly in UTF-8 as Python 2 had once and Go has now. Then things would be kept simple in Japan.
Python 3 feels cumbersome. To handle a raw input as a string, you must decode it in some encoding first. It is a fragile process. It would be adequate to treat the input bytes transparently and put an optional stage to convert other encodings to UTF-8 if necessary.
I understand you're arguing about some sort of equivalency between byte-arrays and Unicode strings. Sure there are half-baked ways to do word-splitting on a byte-array. But why do you consider that a viable option? Under what circumstances would you do that?