> The minor improvements of Python 3 did not warrant breaking backwards compatib...

naniwaduni · on Jan 24, 2021

Let's not pretend that py3's string changes weren't fundamentally wrong and didn't create years of issues trying to decode things that could properly be arbitrary byte sacks as utf-8.

So my answer is that it was a deeply misconceived change that shouldn't have been made at all, let alone been taken as the cornerstone of a "necessary" break in backward compatibility.

orf · on Jan 24, 2021

The string changes where both necessary and correct. There is a difference between bytes and strings and treating them as the same led to so many issues. Thank god I’ve not seen a UnicodeDecodeError in decades.

And the ecosystem agrees.

nerdponx · on Jan 24, 2021

what's wrong about strings representing text?

You're not making an argument about backward compatibility here, you're making a strong claim that representing text as a sequence of Unicode code points is fundamentally wrong. I have never heard anyone make this point before, and I am inclined to disagree, but I'm curious what your reasoning is for it.

naniwaduni · on Jan 24, 2021

Indeed, representing text as a sequence of Unicode code points is fundamentally wrong.

There are no operations on sequences of Unicode code points that are more correct than an analogous operation on bytes.

(Everyone's favourite example, length, actually becomes less correct—a byte array's length at least corresponds to the amount of space one might have to allocate for it in a particular encoding. A length in codepoints is absolutely meaningless both technically and linguistically. And this is, for what little it's worth, close to the only operation you can do on a string without imposing additional restrictions about its context.)

orf · on Jan 24, 2021

> There are no operations on sequences of Unicode code points that are more correct than an analogous operation on bytes.

That’s ridiculous. Uppercasing/lowercasing, slicing, “startswith”, splitting, etc etc.

Your statement is correct if you only care about ascii.

hvdijk · on Jan 24, 2021

Uppercasing/lowercasing cannot be done on Unicode code points, because that fails to handle things like ﬁ -> FI where the uppercased version does not consist of the same number of Unicode code points. Slicing and splitting cannot be done on Unicode code points because it may separate a character from a subsequent combining character. "startswith" cannot be done on Unicode code points because some distinct code points need to be treated as equivalent. These are pretty much the same problems you also have when you perform those same operations on bytes. You might encounter those problems in fewer cases when you perform operations on code points rather than on bytes, but you won't have solved the problems entirely.

naniwaduni · on Jan 24, 2021

Worse, you'll have pushed the problematic cases out of the realm of obviously wrong and not sensible to do, into subtly wrong and will break down the line in ways that will be hard to recognize and debug.

naniwaduni · on Jan 24, 2021

None of those operations are correct on Unicode codepoints. Your statement is only just barely tenable if you only care about well-edited and normalized formal prose in common Western languages.

Even then, upper/lower-casing is iffy.

lolc · on Jan 24, 2021

> There are no operations on sequences of Unicode code points that are more correct than an analogous operation on bytes.

Wow. I wonder how you arrived at this point. You can't, for example, truncate a UTF-8 byte array without the risk of producing a broken string. But this is only the start. Here are two strings, six letters each, one in NFC, the other in NFD, and their byte-length in UTF-8:

    "Åström" is 8 bytes in UTF-8

    "Åström" is 10 bytes in UTF-8

If your software tells the user that one is eight and the other is 10 letters long, it is not "less correct". It is incorrect. Further, if searching for "Åström" won't find "Åström", your software is less useful than it could be if it knew Unicode. (And it's sad how often software gets this wrong.)

tsimionescu · on Jan 24, 2021

> If your software tells the user that one is eight and the other is 10 letters long, it is not "less correct". It is incorrect.

In fact, if the software tells you that either of the strings is either 8 or 10 letters wrong, then either way the software is incorrect - those are both obviously 6 letter strings.

Now, does UTF8 help you discover they are 6 letter strings better than other representations? There are certainly text-oriented libraries that can do that, but not those that simply count the UTF8 code points - they must have an understanding of all of Unicode. Even worse, the question "how many letters does this string have" is not generally meaningful - there are plenty of perfectly valid unicode strings for which this question doesn't have a meaningful answer.

However, the question "how many unicode code points does this string have" is almost never of interest. You either care about some notion of unique glyphs, or you care about byte lengths.

lolc · on Jan 25, 2021

> then either way the software is incorrect - those are both obviously 6 letter strings.

What I wanted to get at is that in Unicode, I have a chance to count letters to some useful degree. Why should I consider starting at byte-arrays?

> there are plenty of perfectly valid Unicode strings for which this question doesn't have a meaningful answer.

I don't get it. Why does the existence of degenerate cases invalidate the usefulness of a Unicode lib? If I want to know how many letters are in a string, I can probably get a useful answer from a Unicode lib. Not for all edge-cases, but I can decide on the trade-offs. If I have a byte-array, I start at a lower level.

a1369209993 · on Jan 27, 2021

> What I wanted to get at is that in Unicode, I have a chance to count letters to some useful degree.

You do not. You merely happen to get the right answer by coincidence in some cases, same as bytes-that-probably-are(n't)-ASCII. To throw your own words back at you:

  "Åström" is 6 code points in Unicode

  "Åström" is 8 code points in Unicode

If your software tells the user that one is six and the other is 8 letters long, it is not "less correct". It is incorrect. Further, if searching for "Åström" won't find "Åström", your software is less useful than it could be if it knew text. (And it's sad how often software gets this wrong.)

naniwaduni · on Jan 24, 2021

You can't truncate a sequence of Unicode codepoints without the risk of producing a broken string, either. What do you get if you truncate "Åström" after the first "o"? What do you get if you truncate 🇨🇦 after the first codepoint?

Normalization is not a real solution unless you restrict yourself to working with well-edited formal prose in common Western languages.

This is not a claim made from ignorance.

lolc · on Jan 24, 2021

Sorry, we're mixing two layers. Of course, if I truncate a string, it may lose its meaning. And having accents fall off is problematic. But it's not the same as truncating a byte-array, because then an invalid sequence of bytes may result.

Stop treating these cases as equivalent. They're not.

naniwaduni · on Jan 24, 2021

They are equivalent. The only reason you find it problematic that a sequence of bytes is "invalid" (read: can't be decoded in your preferred encoding) is because you've manufactured the problem.

In the end, the only layer at which it really matters whether your byte sequence can be decoded is the font renderer, and just being valid utf-8 isn't good enough for it either.

lolc · on Jan 24, 2021

> In the end, the only layer at which it really matters whether your byte sequence can be decoded is the font renderer

Ok that explains how we ended up here. I'm considering some other common uses! A search-index for example greatly profits from being able to normalize representations and split words.

naniwaduni · on Jan 24, 2021

Search index use cases probably also benefit from normalizing inputs across encodings, that's no shining example of utf-8-onlyism.

You can still best-effort split words! You can do a pretty good job splitting words without ensuring that the words decode in your preferred encoding.

lolc · on Jan 25, 2021

Here's the thing: I don't want to work in UTF8. I want to work in Unicode. Big difference. Because tracking the encoding of my strings would increase complexity. So at the earliest convenience, I validate my assumptions about encoding and let a lower layer handle it from then on.

I understand you're arguing about some sort of equivalency between byte-arrays and Unicode strings. Sure there are half-baked ways to do word-splitting on a byte-array. But why do you consider that a viable option? Under what circumstances would you do that?

naniwaduni · on Jan 25, 2021

Every circumstance. Why do you consider it unviable? What problems do you think having a Unicode sequence solves?

lolc · on Jan 25, 2021

Convince me. Here's a little library function that turns text into a set of words:

    def keywords(text):
        return set(filter(None, re.split("\W+", unicodedata.normalize("NFKC", input_str).lower())))

How would this look if strings were byte-arrays? How would `normalize()`, `lower()`, and `split()` know what encoding to use?

The way I see it: If the encoding is implicit, you have global state. If it's explicit, you have to pass the encoding. Both is extra state to worry about. When the passed value is a Unicode string, this question doesn't come up.

naniwaduni · on Jan 25, 2021

It looks pretty much the some, except that you assume the input is already in your library's canonical encoding (probably utf-8 nowadays).

I realize this sounds like a total cop-out, but when the use-case is destructively best-effort tokenizing an input string using library functions, it doesn't really matter whether your internal encoding is utf-32 or utf-8. I mean, under the hood, normalize still has to map arbitrary-length sequences to arbitrary-length sequences even when working with utf-32 (see: unicodedata.normalize("NFKC", "a\u0301 ﬃ") == "\xe1 ffi").

So on the happy path, you don't see much of a difference.

The main observable difference is that if you take input without decoding it explicitly, then the always-decode approach has already crashed long before reaching this function, while the assume-the-encoding approach probably spouts gibberish at this point. And sure, there are plenty of plausible scenarios where you'd rather get the crash than subtly broken behaviour. But ... I don't see this reasonably being one of them, considering that you're apparently okay with discarding all \W+.

suzuki · on Jan 25, 2021

I agree with you. I wish Python 3 had strings as byte sequences mainly in UTF-8 as Python 2 had once and Go has now. Then things would be kept simple in Japan. Python 3 feels cumbersome. To handle a raw input as a string, you must decode it in some encoding first. It is a fragile process. It would be adequate to treat the input bytes transparently and put an optional stage to convert other encodings to UTF-8 if necessary.

lolc · on Jan 25, 2021

I know this from PHP, where I have to be aware of the encoding the strings are in. I still don't see what the advantage should be of that.

Jasper_ · on Jan 24, 2021

So in one case, the text becomes corrupted and unreadable (i.e. loses its meaning), and in the other, it becomes corrupted and unreadable. What's the difference?

Having "accents fall off" has gotten people murdered [0]. Accents aren't things peppered in for effect, they turn letters into different letters, spelling different words. Analogously, imagine that a bunch of software accidentally turned every "d" into a "c" because some committee halfway around the world decided "d" should be composed of the "c" and "|" glyphs. That's the kind of text corruption that regularly happens in other languages when dealing with text at the code point layer.

[0] https://languagelog.ldc.upenn.edu/nll/?p=73 . Note that this is Turkish, which has the "dotted i" problem, meaning that this was more than likely a .toupper() gone wrong rather than a truncation issue.

lolc · on Jan 24, 2021

The difference is that for truncating, I can work within Unicode to deal with the situation. I can accept the possibility of mutilated letters, I can convert to NFC, I can truncate on word-boundaries, I have choice.

If I have an byte-array, I can do none of these things short of implementing a good chunk of Unicode. If I truncate, I risk ending up with an invalid UTF-8 string. End of story.

tsimionescu · on Jan 24, 2021

And what is wrong with an invalid UTF-8 string? Why were you truncating the string in the first place?

Basically, I believe the point here is that a Unicode aware truncation should be done in a Unicode aware truncate method. There is no good reason to parse a string as UTF-8 ahead of time - just keep it as a blob of bytes until you need to do some something "texty" with it. It is the truncate-at-word-boundaries() method that should interpret the bytes as UTF-8 and fail if they are not valid. Why parse it sooner?

Jasper_ · on Jan 24, 2021

> If I have an byte-array, I can do none of these things short of implementing a good chunk of Unicode. If I truncate, I risk ending up with an invalid UTF-8 string.

Yes, and? You can have an invalid sequence of Unicode code points too, such as an unpaired surrogate (something Python's text model actually abuses to store "invalid Unicode" in a special, non-standard way).

If you truncate at the byte level, you are just truncating "between code points"; it's a closer granularity than at the code point layer, so you can also convert to NFC, truncate on word boundaries, etc. You just need to ignore the parts of the UTF-8 string that are invalid; which isn't difficult, because UTF-8 is self-synchronizing.

Blikkentrekker · on Jan 24, 2021

> How would you have handled the string/bytes split in a way that’s backwards compatible?

A language pragma.

All functions that return `bytes` continue to do so unless specifically opted in on a per file basis, then they return `unicode`.

`str` thus returns `bytes` as it does in 2, unless the pragma ask otherwise.

> Or the removal of old-style classes?

They would obviously not be removed and still be available but depræcated.

orf · on Jan 24, 2021

> All functions that return `bytes` continue to do so unless specifically opted in on a per file basis, then they return `unicode`.

Nothing in py2 returns bytes. They all return strings. That is the issue. What about subclasses or type wrappers? What about functions that return bytes or utf8 strings? How would you handle code that then calls “.startswith()” on a returned string/bytes value?

A language pragma that fundamentally alters a built in type across all the code you have in a program is never going to work and pushes the burden onto library authors to support a large matrix of different behaviours and types.

It would make the already ridiculous py2 str/bytes situation even more ridiculous.

> They would obviously not be removed and still be available but depræcated.

Having two almost separate object models in the same language is rather silly.

Blikkentrekker · on Jan 24, 2021

> Nothing in py2 returns bytes. They all return strings. That is the issue.

No, that is not an issue, that is semantics.

What one calls it does not change the behavior. And aside that the system could perfectly well be designed that this pragma changes that `str` is synonymous with either `bytes` or `unicode` depending on it's state.

What about subclasses or type wrappers? What about functions that return bytes or utf8 strings? How would you handle code that then calls “.startswith()” on a returned string/bytes value?

You would now which is which by using the pragma or not.

Not using the pragma defaults to the old behavior, as said, one only receives the new, breaking behavior, when one opts in.

Python could even support always opting in by a configuration file option for those that really want it and don't want to add the pragma at the top of every file.

> A language pragma that fundamentally alters a built in type across all the code you have in a program is never going to work and pushes the burden onto library authors to support a large matrix of different behaviours and types.

Opposed to the burden they already had of maintaining a 2 and 3 version?

Any new code can of course always return `unicode` rather than `str` which in this scheme is normally `bytes` but becomes `unicode` with the pragma.

It would make the already ridiculous py2 str/bytes situation even more ridiculous.

> Having two almost separate object models in the same language is rather silly.

Yes, it is, and you will find that most languages are full of such legacy things that no new code uses but are simply for legacy purposes.

“It is silly.” turns out to be a rather small price to pay to achieve. “We have not broken backwards compatibility.”.

orf · on Jan 24, 2021

I don’t really have the time of inclination to continue arguing, but I will point out that you say all this as though the approach the team took failed. It worked. The ecosystem is on py3.

You can imagine some world with a crazy context-dependent string/bytes type. Cool. In reality this would have caused endless confusion, especially with beginners and the scientific community, and likely killed the language or at the very least made the language a shadow of what it is now.

They made the right choice given the outcome. Anything else is armchair postulation that was discussed previously and outright rejected for obvious reasons.

dataflow · on Jan 24, 2021

> It worked. The ecosystem is on py3.

Because they're doing everything they can to force py2 to go away. It's not it's dying a natural death out of disuse. Exhibit A is everyone else in this post still wanting to use it.

If you think strings "work" under py3, my guess is you've never had to deal with all the edge cases, especially across all the 3 major desktop platforms. Possibly because your applications are limited in scope. (You're definitely not writing general-purpose libraries that guarantee correctness for a wide variety of usage.) Most things Python treats as Unicode text by default (file contents, file paths, command-line arguments, stdio streams, etc.) are not guaranteed to be contain only Unicode. They can have invalid Unicode mixed into them, either accidentally or intentionally, breaking programs needlessly.

As a small example, try these and compare:

  python2 -c "import sys; print('Your input was:'); print(sys.argv[1])" $'\x80' | xxd
  python3 -c "import sys; print('Your input was:'); print(sys.argv[1])" $'\x80' | xxd

This program is content-agnostic (like `cat`, `printf`, etc.), and hence, with a decent standard library implementation, you would expect it to be able to pass arbitrary data through just fine. But it doesn't, because Python insists on treating arguments as Unicode strings rather than as raw data, and it behaves worse on Python 3 than Python 2. You really have to go out of your way to make it work correctly—and the solution is often pretty much to just ditch strings in many places and deal with bytes as much as possible... i.e., you realize Unicode strings were the wrong data type. But since you're still forced to deal with them in some ways, you get the worst of both worlds and that increases the complexity dramatically and it become increasingly painful to ensure your program still works correctly as it evolves.

I say all these because I've run into these and dealt with them, and it's become clear to me that others who love Unicode strings just haven't gone very far in trying to use them. Often this seems to be because they (a) are writing limited-scope programs rather than libraries, (b) confine themselves to nice, sanitized systems & inputs, and/or (c) take an "out-of-sight -> out-of-mind" attitude towards issues that don't immediately crop up on their systems & inputs.

orf · on Jan 24, 2021

> You're definitely not writing general-purpose libraries that guarantee correctness for a wide variety of usage.

At the risk of sounding like a dick, I’m a member of the Django technical board and have been involved with its development for quite a while. Is that widely used or general purpose enough?

If you want a string then it needs to be a valid string with a known encoding (not necessarily utf8). If you want to pass through any data regardless of its contents then you use bytes. They are two very different things with very different use cases.

If I read a file as utf8 I want it to error if it contains garbage, non-text contents because the decoding failed. Any other way pushes the error down later into your system to places that assume a string contains a string but it’s actually arbitrary bytes. We did this in py2 and it was a nightmare.

I concede that it’s convenient to ignore the difference in some circumstances, but differentiating between bytes/str has a lot of advantages and makes Python code more resilient and easier to read.

dataflow · on Jan 24, 2021

> I’m a member of the Django technical board and have been involved with its development for quite a while. Is that widely used or general purpose enough?

That's not quite what I was saying here. Note I said "wide variety of usage", not "widely used". Django is a web development framework—and its purpose is very clear and specific: to build a web app. Crucially, a web framework knows what its encoding constraints are at its boundaries, and it is supposed to enforce them. For examples, HTTP headers are known to be ASCII, HTML files have <meta ...> tags to declare encodings, etc. So if a user says (say) "what if I want to output non-ASCII in the headers?", your response is supposed to be "we don't let you do that because that's actually wrong". Contrast this with platform I/O where the library is supposed to work transparently without any knowledge of any encoding (or lack thereof) for the data it deals with, because that's a higher-level concern and you don't expect the library to impose artificial constraints of its own.

naniwaduni · on Jan 24, 2021

"If I read a book as Russian, I want it to error if it contains French, non-Russian contents because the decoding failed. Any other way pushes the error down later into your system to readers that assume a Russian passage contains Russian but it's actually arbitrary text. We did this in War and Peace and it was a nightmare."

orf · on Jan 24, 2021

“If I expect a delivery of war and peace in English, I want it to error if I actually receive a stone tablet containing Neanderthal cave paintings thrown through my window at night”. They are two very different things, even if they both contain some form of information.

naniwaduni · on Jan 24, 2021

You are engaged in some deep magical thinking about what encodings, to believe that knowing the encoding of a so-called string allows you to perform any operations on it more correctly than on sack of bytes. (Fewer, in fact—at least the length of a byte array has any meaning at all.)

It's an easy but very much confused mistake to make if the text you work with is limited to European languages and Chinese.

orf · on Jan 24, 2021

> You are engaged in some deep magical thinking about what encodings, to believe that knowing the encoding of a so-called string allows you to perform any operations on it more correctly than on sack of bytes.

Not really. How would “.toupper()” work on a raw set of bytes, which would either contain an MP3 file or UTF8 encoded text?

Every single operation on a string-that-might-not-be-a-string-really would have to be fallible, which is a terrible interface to have for the happy path.

How would slicing work? I want the first 4 characters of a given string. That’s completely meaningless without an encoding (not that it means much with it).

How would concatenation work? I’m not saying Python does this, but concatenation two graphemes together doesn’t necessarily create a string with len() == 2.

How would “.startswith()” work with regards to grapheme clusters?

Text is different from bytes. There’s extra meaning and information attached to an arbitrary stream of 1s and 0s that allows you to do things you wouldn’t have been able to do if your base type is “just bytes”.

Sure you could make all of these return garbage if your “string” is actually an mp3 file, aka the JavaScript way, but... why?

naniwaduni · on Jan 24, 2021

> Not really. How would “.toupper()” work on a raw set of bytes, which would either contain an MP3 file or UTF8 encoded text?

It doesn't. It doesn't work with Unicode either. No, not "would need giant tables", literally doesn't work—you need to know whether your text is Turkish.

> How would slicing work? I want the first 4 characters of a given string. That’s completely meaningless without an encoding.

It's meaningless with an encoding: what are the first four characters of "áíúéó"? Do you expect "áí"? What are the first four characters of "ﷺ"? Trick question, that's one unicode codepoint.

At least with bytes you know that your result after slicing four bytes will fit in a 4-byte buffer.

> How would concatenation work? I’m not saying Python does this, but concatenation two graphemes together doesn’t necessarily create a string with len() == 2.

It doesn't work with Unicode either. I'm sure you've enjoyed the results of concatenating a string with an RTL marker with unsuspecting text.

It gets worse if we remember try to ascribe linguistic meaning to the text. What's the result of concatenating "ranch dips" with "hit singles"?

> How would “.startswith()” work with regards to grapheme clusters?

It doesn't. "🇨" is a prefix of "🇨🇦"; "i" is not a prefix of "ĳ".

> Text is different from bytes. There’s extra meaning and information attached to an arbitrary stream of 1s and 0s that allows you to do things you wouldn’t have been able to before.

None of the distinctions you're trying to make are tenable.

mannykannot · on Jan 24, 2021

It is not clear to me whether there is a material difference here. Any text string is a sequence of bytes for which some interpretation is intended, and many meaningful operations on those bytes will not be meaningful unless that interpretation is taken into account.

The problem that you have raised here seems to be one of what alphabet or language is being used, but that issue cannot even arise without taking the interpretation into account. If you want alphabet-aware, language-aware, spelling-aware or grammar-aware operators, these will all have to be layered on top of merely byte-aware operations, and this cannot be done without taking into account the intended interpretation of the bytes sequence.

Note that it is not unusual to embed strings of one language within strings written in another. I do not suppose it would be surprising to see some French in a Russian-language War and Peace.

naniwaduni · on Jan 24, 2021

This implies that you should have types for every intended use of a text string. This is, in fact, a sensible approach, reasonably popular in languages with GADTs, even if a bit cumbersome to apply universally.

A type to specify encoding alone? Totally useless. You can just as well implement those operations on top of a byte string assuming the encoding and language &c., as you can implement those operations on top of a Unicode sequence assuming language and culture &c..

mannykannot · on Jan 24, 2021

To implement any of the above, while studiously avoiding anything making explicit the fact that the interpretation of the bytes as a sequence of glyphs is an intended, necessary and separable step on the way, would be bizzarre and tendentious.

I see you have been editing your post concurrently with my reply:

> You can just as well implement those operations on top of a byte string assuming the encoding and language &c., as you can implement those operations on top of a Unicode sequence assuming language and culture &c..

Of course you can (though maybe not "just as well"), but that does not mean it is the best way to do so, and certainly not that it is "totally useless" to implement the decoding as a separate step. Separation of concerns is a key aspect of software engineering.

naniwaduni · on Jan 24, 2021

> To implement any of the above, while studiously avoiding anything making explicit the fact that the interpretation of the bytes as a sequence of glyphs is an intended, necessary and separable step on the way, would be bizzarre and tendentious.

Codepoints are not glyphs. Nor are any useful operations generally performed on glyphs in the first place. Almost all interpretable operations you might want to do are better conceived of as operating as substrings of arbitrary length, rather than glyphs, and byte substrings do this better than unicode codepoint sequences anyway.

So I contest the position that interpreting bytes as a glyph sequence is a viable step at all.

mannykannot · on Jan 24, 2021

Fair enough, codepoints, but the issue remains the same: you keep asserting that it is pointless - harmful, actually - to make use of this one particular interpretation from the hierarchy that exists, without offering any valid justification for why this one particular interpretation must be avoided, while both lower-level and higher-level interpretations are useful (necessary, even.)

Going back to the post I originally replied to, how would going down to a bytes view avoid the problems you see?

naniwaduni · on Jan 24, 2021

Let me rephrase. Codepoints are even less useful than abstract glyphs, cf. https://manishearth.github.io/blog/2017/01/14/stop-ascribing... (I don't agree 100% with the write-up, and in particular I would say that working on EGCs is still just punting the problem one more layer without resolving it; see some of my other posts in this thread. But it makes an attempt at clarifying the issue here.)

The choice of the bytes view specifically is just that it's the most popular view from which you can achieve one specific primitive: figuring out how much space a (sub)string occupies in whatever representation you store it in. A byte length achieves this. Of course, a length in bits or in utf-32 code units also achieves this, but I've found it rather uncommon to use utf-32 as a transfer encoding. So we need at least one string type with this property.

Other than this one particular niche, a codepoint view doesn't do much worse at most tasks. But it adds a layer of complexity while also not actually solving any of the problems you'd want it to. In fact, it papers over many of them, making it less obvious that the problems are still there to a team of eurocentric developers ... up until emoji suddenly become popular.

Now, I can understand the appeal of making your immediate problems vanish and leaving it for your successors, but I hope we can agree that it's not in good taste.

mannykannot · on Jan 24, 2021

While all the facts in this post appear correct, they do not seem to me to amount to an argument either for the proposition that an implementation at the utf-8 level is uniquely harmful, or that a bytes-level approach avoids these problems.

For example, working with the utf-8 view does not somehow foreclose on knowing how much memory a (sub)string occupies, and it certainly does not follow that, because this involves regarding the string as a sequence of bytes, this is the only way to regard it.

For another, let's consider a point from the linked article: "One false assumption that’s often made is that code points are a single column wide. They’re not. They sometimes bunch up to form characters that fit in single “columns”. This is often dependent on the font, and if your application relies on this, you should be querying the font." How does taking a bytes view make this any less of a potential problem?

Is a team of eurocentric developers likely to do any better working with bytes? Their misconceptions would seem to be at a higher level of abstraction than either bytes or utf-8.

You are claiming that taking a utf-8 view is an additional layer of complexity, but how does it simplify things to do all your operations at the byte level? Using utf-8 is more complex than using ascii, but that is beside the point: we have left ascii behind and replaced it with other, more capable abstractions, and it is a universal principle of software engineering that we should make use of abstractions, because they simplify things. It is also quite widely acknowledged that the use of types reduces the scope for error (every high-level language uses them.)

naniwaduni · on Jan 24, 2021

The burden of proof is on showing that the unicode view is, in your words, a more capable abstraction. My thesis is that it is not. This is not because it necessarily does anything worse (though it does). It must simply do something better. If there were actually anything at all it did better—well, I still wouldn't necessarily want it as a default but it would be a defensible abstraction.

The heart of the matter is that a Unicode codepoint sequence view of a string has no real use case.

There is no "universal principle" that we use abstractions always, regardless of whether they fit the problem; that's cargo-culting. An abstraction that does no work is, ceteris paribus, worse than not having it at all.

mannykannot · on Jan 25, 2021

> The burden of proof is on showing that the unicode view is, in your words, a more capable abstraction. My thesis is that it is not.

The quote, as you presented it, leaves open the question: more capable than what? Well, there's no doubt about it if you go back to my original post: more capable than ascii. Up until now, as far as I can tell, your thesis has not been that unicode is less capable than ascii, but if that's what your argument hangs on, go ahead - make that case.

What your thesis has been, up to this point, is that manipulating text as bytes is better, to the extent that doing it as unicode is harmful.

> It must simply do something better. If there were actually anything at all it did better...

It is amusing that you mentioned the burden of proof earlier, because what you have completely avoided doing so far is justify your position that manipulating bytes is better - for example, you have not answered any of the questions I posed in my previous post.

> The heart of the matter is that a Unicode codepoint sequence view of a string has no real use case.

Here we have another assertion presented without justification.

> There is no "universal principle" that we use abstractions always, regardless of whether they fit the problem...

It is about as close as anthing gets to a universal principle in software engineering, and if you want to disagree on that, go ahead, I'm ready to defend that point of view.

>... that's cargo-culting.

How about presenting an actual argument, instead of this bullshit?

Furthermore, you could take that statement out of my previous post, and it would do nothing to support the thesis you had been pushing up to that point. You seem to be seeking anything in my words that you think you can argue against, without regard to relevance - but in doing so, you might be digging a deeper hole.

> An abstraction that does no work is, ceteris paribus, worse than not having it at all.

Your use of a Latin phrase does not alter the fact that you are still making unsubstantiated claims.

naniwaduni · on Jan 25, 2021

Put it this way: claim a use-case you believe the unicode view does better on than an array of bytes. Since you're making the positive claim, this should be easy.

I guarantee you there will be a quick counterexample to demonstrate that the claimed use-case is incorrect. There always is.

You may review the gish gallop in the other branch of this thread for inspiration.

mannykannot · on Jan 25, 2021

Now you are attempting a full-on burden-shifting approach, but the unsupported claims here are that a unicode view is "fundamentally wrong" and that the proper approach is to operate on raw bytes. You can start on correcting this omission by answering the questions I posed about your claims a couple of posts ago.

https://news.ycombinator.com/item?id=25895523

naniwaduni · on Jan 25, 2021

And you can start holding a position instead of claiming omission until the cows come home.

mannykannot · on Jan 25, 2021

It is highly apposite that you should mention a Gish gallop in your earlier post: Duane Gish's combination of rhetorical ploys intended to persuade uninformed people that evolution hasn't happened. It includes a combination of burden-shifting, not answering awkward questions, introducing non-sequiturs, and attempting to change the subject, all of which you have already been employing, and which you are now intimating that you intend to continue with.

A Gish gallop is not, of course, a sound way to arrive at any sort of truth. Anyone employing the technique is either unaware of that, or is being duplicitous (my guess is that Gish never really understood that it is bogus.)

Meanwhile, those questions are still waiting to be milked, so to speak...

https://news.ycombinator.com/item?id=25895523

gugagore · on Jan 24, 2021

If I recall, this is the solution: https://stackoverflow.com/a/27185688

I don't know why there isn't a sys.argvb as there is is.environb

Wowfunhappy · on Jan 24, 2021

Fwiw, the python3 version didn't run at all for me in Python 3.9.0 on Mac.

    "UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed

Chris2048 · on Jan 24, 2021

> It worked. The ecosystem is on py3.

Primarily because they killed py2, not because they won people over with their Unicode approach.

Wowfunhappy · on Jan 24, 2021

I agree with this, but I would make one important tweak: make the new behavior opt-out, instead of opt-in, with a configuration file option for switching the default.

You're still breaking code by default this way, but no one would have trouble updating.

My concern is that, if you don't make the preferred behavior clear, a lot of people would simply never adopt it. I don't think that Python's userbase in particular is going to spend time reading documentation on best practices.

Blikkentrekker · on Jan 24, 2021

I do believe that such a trivial change would indeed be fine. If one can go to the effort of installing the new version, one can add one line in a configuration file to depend upon old behavior.

IshKebab · on Jan 24, 2021

Yeah even CMake got the bit right. CMake!!

ktpsns · on Jan 24, 2021

I think some modular approach could have solved the incompatiblity issue, such as "from future import ...". Shorthands could have been invented to define everything in a single line.

Perl5 has similar flags ("use strict"), and Racket brings it even further to define the whole fucking language of the rest of the file ("#lang racket/gui"). Having the language being choosable by the user is against the "zen of python", I guess. In other words: Such an attemp does not feel "pythonic".

orf · on Jan 24, 2021

For some syntactical sugar, like print-function, sure. But there are more fundamental changes that couldn’t be papered over.

jamiesonbecker · on Jan 24, 2021

Fundamental changes shouldn't have been papered over by calling the new language Python, as it was a fundamentally different language at that point.

orf · on Jan 24, 2021

No, it’s the same language but with different semantics around a specific type. That’s not a different language and code can co-exist with a bit of thought.

JasonFruit · on Jan 24, 2021

Every language goes through this at some point in its development: flaws that limit future development have to be fixed. Should every language rename itself and split its community at that point? That seems like an extreme response to a common problem.

Chris2048 · on Jan 24, 2021

> Should every language rename itself and split its community at that point?

yes. If breaking, fundamental changes are common, that's a problem.

JasonFruit · on Jan 24, 2021

That people can make an initial plan that is self-consistent, logical, and foresees and provides for all future use-cases is a basic tenet of waterfall-style development. The history of software engineering does not uphold that principle. Why would it be different for language designers?

Chris2048 · on Jan 24, 2021

Just because an unforeseen issue arises, doesn't mean you need to introduce a breaking change right away.

djur · on Jan 24, 2021

The "new language" is called Python 3.

a1369209993 · on Jan 27, 2021

Yes, yes it is. And, like "Perl 6" and to a lesser extent "C++", that name is misleading (and therefore bad), because there is already a different language called "Python" (respectively "Perl", "C"), with significant superficial similarities that it could be confused with.

lizmat · on Feb 1, 2021

Please note that the misleading part of Perl 6 has been fixed by renaming it to the Raku Programming Language (https://raku.org using the #rakulang tag on social media).

nicoburns · on Jan 24, 2021

> How would you have handled the string/bytes split in a way that’s backwards compatible?

My understanding is that the corresponding types are available in both 2 and 3, they're just named differently. A different one is "string". So you could have had some kind of mode directive at the top of the file which controlled which version that file was in, and allow files from 2 and 3 to be run together.

orf · on Jan 24, 2021

Actually think about it. bytes is str in Python 2. There is no bytes type in py2. How would a per-file directive (of all things) help?

What if one function running in “py2 mode” returned a string-that-is-actually-bytes, how would a function in “py3 mode” consume it? What would the type be? If different, how would it be detected or converted? What if it retuned a utf8 string OR bytes? What if that py3 function then passed it to a py2 function - would it become a string again? Would you have two string types - py2string that accepts anything and py3string that only works with utf8? How would this all work with C modules?

nicoburns · on Jan 24, 2021

> What if one function running in “py2 mode” returned a string-that-is-actually-bytes, how would a function in “py3 mode” consume it? What would the type be?

It would be bytes. Because py2 string === py3 bytes.

> What if that py3 function then passed it to a py2 function - would it become a string again?

Yes

> Would you have two string types - py2string that accepts anything and py3string that only works with utf8?

Yes. You already have those two types in python3. bytes and string. You'd just alias those as string and utf8 or whatever you want to call it in python2.

How would this all work with C modules?

They'd have specify which mode they were working with too.

orf · on Jan 24, 2021

But all this would require huge rewrites of code and would never be backward compatible. You’re trading “py2 vs py3” with “py2 mode vs py3 mode”.

So you’d have some magic code that switches py2str to bytes. Which means every py3 caller has to cast bytes into a string to do anything useful with it, because returning strings is the most common case. Then that code has to be removed when the code it’s calling is updated to py3 mode. Which is basically the blue/green issue you see with async functions but way, way worse.

Then you’d need to handle subclasses, wrappers of bytes/str, returning collections of strings across py2/py3 boundaries (would these be copies? Different types? How would type(value[0]) work?), ending up with mixed lists/dicts of bytes and strings depending on the function context, etc etc.

It would become an absolute complete clusterfuck of corner cases that would have killed the language outright.

nicoburns · on Jan 24, 2021

> You’re trading “py2 vs py3” with “py2 mode vs py3 mode”.

Yes, that's the whole point. Because compatible modes allow for a gradual transition. Which in practice allows for a much faster transition, because you don't have to transition everything at once (which puts some people off transitioning entirely - making things infinitely harder for everyone else).

Languages like Rust (editions) and JavaScript (strict mode) have done this successfully and relatively painlessly.

> So you’d have some magic code that switches py2str to bytes. Which means every py3 caller has to cast bytes into a string to do anything useful with it, because returning strings is the most common case. Then that code has to be removed when the code it’s calling is updated to py3 mode. Which is basically the blue/green issue you see with async functions but way, way worse.

Well yes, you'd still have to upgrade your code. That goes with a major version bump. But it would allow you to do it on a library-by-library basis rather than forcing you to wait until every dependency has a v3 version. Have that one dependency that keeping you stuck on v2: no problem, upgrade everything else and wrap that one lib in conversion code.

> Then you’d need to handle subclasses, wrappers of bytes/str, returning collections of strings across py2/py3 boundaries (would these be copies? Different types? How would type(value[0]) work?), ending up with mixed lists/dicts of bytes and strings depending on the function context, etc etc.

I'm not sure I understand the problem here. The types themselves are the same between python 2 and 3 (or could have been). It's just the labels that refer to them that are different. A subclass of string in python 2 code would just be a subclass of bytes in python 3 code.

musicale · on Jan 26, 2021

We have lots of existence proofs of languages evolving gracefully and not throwing old code off a cliff.

Python 3 made the wrong trade-off of core developer hours vs. external developer hours.

Nullabillity · on Jan 24, 2021

py2 str == py3 bytes

py2 unicode == py3 str

The problem with this approach is that they wanted to reuse the `str` name, which requires a big "flag day", where it switches meaning and compatibility is effectively impossible across that boundary (without ugly hacks).

What they could have done instead would have been to just rename `str` to `bytes`, but retain a deprecated `str` alias that pointed to `bytes`.

That would keep old scripts running indefinitely, while hopefully spewing enough warnings that any maintained libraries and scripts would make the transition.

Eventually they could remove `str` entirely (though I'd personally be against it), but that would still give an actual transition period where everything would be seamlessly compatible.

Same thing with literals: deprecate bare strings, and transition to having to pick explicitly between `b"foo"` and `u"foo"`. Eventually consider removing bare strings entirely. DO NOT just change the meaning of bare strings while removing the ability to pick the default explicitly (in contrast, 3.0 removed `u"asdf"`, and it was only reintroduced several versions later).

What made me personally lose faith in the Python Core team wasn't that Guido made an old mistake a long time ago. It wasn't that they wanted to fix it. It was the absolutely bone-headed way that they prioritized aesthetics over the migration story.

falcor84 · on Jan 24, 2021

>Would you have two string types - py2string that accepts anything and py3string that only works with utf8? Yes. A single naive one for py2 and two separate ones for py3 bytes and unicode. All casting between the two would have to be made explicit.

> How would this all work with C modules? In non-strict mode, you'd be able to use either py2 strings or py3 bytes with these, and gradually move all modules to strict mode which requires bytes.

And then, gradually after a decade or so attempt to get rid of all py2 types.

musicale · on Jan 24, 2021

> How would you have handled the string/bytes split in a way that’s backwards compatible? Or the removal of old-style classes?

I'm not sure it's the best way to handle it, but I would have been fine with:

    from __python2__ import *

for full backward compatibility; or, more explicitly:

    from __python2__ import ascii_strings, old_style_classes, print_statement, ...

As the parent poster mentions, several other popular languages and systems (C++, Java, etc.) have done a pretty decent job preserving backward compatibility, for good reason: it saves millions of hours of human effort. It's embarrassing and disappointing that Python simply blew it with the Python 2 to 3 transition.

Maybe we could still evolve pypi to support a compatibility layer to allow easy mixing of python2 and python3 code, but I get the feeling that Python 3 has poisoned the well.

imtringued · on Jan 24, 2021

When I was learning Python 6 years ago I was the only one using Python 3 in my group because I use arch linux. It was very basic code and everyone basically solved the same problem. Everyone else's code didn't work on my machine because print is not a statement in Python 3.

That's just plain stupid. Just print a warning and add a python2 flag that hides the warning. Don't release a major version because of something trivial like this.

orf · on Jan 24, 2021

They didn’t release A major version because they changed print from a statement to a function.

TheGoddessInari · on Jan 24, 2021

Python gave everyone 12 years to deal with version 3 being the way forward. There are many fundamental changes.

The fact that people seem to complain exclusively after Python 2' end of life a year ago feels a little telling. Perl's community roffle stomped their previous vision for Perl 6. Python community wasn't vocal about this being a bad change. Rather the opposite, very loud support.

Keep in mind, I dislike Python either way, but I'm not one of the devs that complains about continuing education requirements, or language adding things over each 10 year period. I can work in Python just fine, but that doesn't mean it feels nice & hygienic to use for me personally.