Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It worked. The ecosystem is on py3.

Because they're doing everything they can to force py2 to go away. It's not it's dying a natural death out of disuse. Exhibit A is everyone else in this post still wanting to use it.

If you think strings "work" under py3, my guess is you've never had to deal with all the edge cases, especially across all the 3 major desktop platforms. Possibly because your applications are limited in scope. (You're definitely not writing general-purpose libraries that guarantee correctness for a wide variety of usage.) Most things Python treats as Unicode text by default (file contents, file paths, command-line arguments, stdio streams, etc.) are not guaranteed to be contain only Unicode. They can have invalid Unicode mixed into them, either accidentally or intentionally, breaking programs needlessly.

As a small example, try these and compare:

  python2 -c "import sys; print('Your input was:'); print(sys.argv[1])" $'\x80' | xxd
  python3 -c "import sys; print('Your input was:'); print(sys.argv[1])" $'\x80' | xxd
This program is content-agnostic (like `cat`, `printf`, etc.), and hence, with a decent standard library implementation, you would expect it to be able to pass arbitrary data through just fine. But it doesn't, because Python insists on treating arguments as Unicode strings rather than as raw data, and it behaves worse on Python 3 than Python 2. You really have to go out of your way to make it work correctly—and the solution is often pretty much to just ditch strings in many places and deal with bytes as much as possible... i.e., you realize Unicode strings were the wrong data type. But since you're still forced to deal with them in some ways, you get the worst of both worlds and that increases the complexity dramatically and it become increasingly painful to ensure your program still works correctly as it evolves.

I say all these because I've run into these and dealt with them, and it's become clear to me that others who love Unicode strings just haven't gone very far in trying to use them. Often this seems to be because they (a) are writing limited-scope programs rather than libraries, (b) confine themselves to nice, sanitized systems & inputs, and/or (c) take an "out-of-sight -> out-of-mind" attitude towards issues that don't immediately crop up on their systems & inputs.



> You're definitely not writing general-purpose libraries that guarantee correctness for a wide variety of usage.

At the risk of sounding like a dick, I’m a member of the Django technical board and have been involved with its development for quite a while. Is that widely used or general purpose enough?

If you want a string then it needs to be a valid string with a known encoding (not necessarily utf8). If you want to pass through any data regardless of its contents then you use bytes. They are two very different things with very different use cases.

If I read a file as utf8 I want it to error if it contains garbage, non-text contents because the decoding failed. Any other way pushes the error down later into your system to places that assume a string contains a string but it’s actually arbitrary bytes. We did this in py2 and it was a nightmare.

I concede that it’s convenient to ignore the difference in some circumstances, but differentiating between bytes/str has a lot of advantages and makes Python code more resilient and easier to read.


> I’m a member of the Django technical board and have been involved with its development for quite a while. Is that widely used or general purpose enough?

That's not quite what I was saying here. Note I said "wide variety of usage", not "widely used". Django is a web development framework—and its purpose is very clear and specific: to build a web app. Crucially, a web framework knows what its encoding constraints are at its boundaries, and it is supposed to enforce them. For examples, HTTP headers are known to be ASCII, HTML files have <meta ...> tags to declare encodings, etc. So if a user says (say) "what if I want to output non-ASCII in the headers?", your response is supposed to be "we don't let you do that because that's actually wrong". Contrast this with platform I/O where the library is supposed to work transparently without any knowledge of any encoding (or lack thereof) for the data it deals with, because that's a higher-level concern and you don't expect the library to impose artificial constraints of its own.


"If I read a book as Russian, I want it to error if it contains French, non-Russian contents because the decoding failed. Any other way pushes the error down later into your system to readers that assume a Russian passage contains Russian but it's actually arbitrary text. We did this in War and Peace and it was a nightmare."


“If I expect a delivery of war and peace in English, I want it to error if I actually receive a stone tablet containing Neanderthal cave paintings thrown through my window at night”. They are two very different things, even if they both contain some form of information.


You are engaged in some deep magical thinking about what encodings, to believe that knowing the encoding of a so-called string allows you to perform any operations on it more correctly than on sack of bytes. (Fewer, in fact—at least the length of a byte array has any meaning at all.)

It's an easy but very much confused mistake to make if the text you work with is limited to European languages and Chinese.


> You are engaged in some deep magical thinking about what encodings, to believe that knowing the encoding of a so-called string allows you to perform any operations on it more correctly than on sack of bytes.

Not really. How would “.toupper()” work on a raw set of bytes, which would either contain an MP3 file or UTF8 encoded text?

Every single operation on a string-that-might-not-be-a-string-really would have to be fallible, which is a terrible interface to have for the happy path.

How would slicing work? I want the first 4 characters of a given string. That’s completely meaningless without an encoding (not that it means much with it).

How would concatenation work? I’m not saying Python does this, but concatenation two graphemes together doesn’t necessarily create a string with len() == 2.

How would “.startswith()” work with regards to grapheme clusters?

Text is different from bytes. There’s extra meaning and information attached to an arbitrary stream of 1s and 0s that allows you to do things you wouldn’t have been able to do if your base type is “just bytes”.

Sure you could make all of these return garbage if your “string” is actually an mp3 file, aka the JavaScript way, but... why?


> Not really. How would “.toupper()” work on a raw set of bytes, which would either contain an MP3 file or UTF8 encoded text?

It doesn't. It doesn't work with Unicode either. No, not "would need giant tables", literally doesn't work—you need to know whether your text is Turkish.

> How would slicing work? I want the first 4 characters of a given string. That’s completely meaningless without an encoding.

It's meaningless with an encoding: what are the first four characters of "áíúéó"? Do you expect "áí"? What are the first four characters of "ﷺ"? Trick question, that's one unicode codepoint.

At least with bytes you know that your result after slicing four bytes will fit in a 4-byte buffer.

> How would concatenation work? I’m not saying Python does this, but concatenation two graphemes together doesn’t necessarily create a string with len() == 2.

It doesn't work with Unicode either. I'm sure you've enjoyed the results of concatenating a string with an RTL marker with unsuspecting text.

It gets worse if we remember try to ascribe linguistic meaning to the text. What's the result of concatenating "ranch dips" with "hit singles"?

> How would “.startswith()” work with regards to grapheme clusters?

It doesn't. "🇨" is a prefix of "🇨🇦"; "i" is not a prefix of "ij".

> Text is different from bytes. There’s extra meaning and information attached to an arbitrary stream of 1s and 0s that allows you to do things you wouldn’t have been able to before.

None of the distinctions you're trying to make are tenable.


It is not clear to me whether there is a material difference here. Any text string is a sequence of bytes for which some interpretation is intended, and many meaningful operations on those bytes will not be meaningful unless that interpretation is taken into account.

The problem that you have raised here seems to be one of what alphabet or language is being used, but that issue cannot even arise without taking the interpretation into account. If you want alphabet-aware, language-aware, spelling-aware or grammar-aware operators, these will all have to be layered on top of merely byte-aware operations, and this cannot be done without taking into account the intended interpretation of the bytes sequence.

Note that it is not unusual to embed strings of one language within strings written in another. I do not suppose it would be surprising to see some French in a Russian-language War and Peace.


This implies that you should have types for every intended use of a text string. This is, in fact, a sensible approach, reasonably popular in languages with GADTs, even if a bit cumbersome to apply universally.

A type to specify encoding alone? Totally useless. You can just as well implement those operations on top of a byte string assuming the encoding and language &c., as you can implement those operations on top of a Unicode sequence assuming language and culture &c..


To implement any of the above, while studiously avoiding anything making explicit the fact that the interpretation of the bytes as a sequence of glyphs is an intended, necessary and separable step on the way, would be bizzarre and tendentious.

I see you have been editing your post concurrently with my reply:

> You can just as well implement those operations on top of a byte string assuming the encoding and language &c., as you can implement those operations on top of a Unicode sequence assuming language and culture &c..

Of course you can (though maybe not "just as well"), but that does not mean it is the best way to do so, and certainly not that it is "totally useless" to implement the decoding as a separate step. Separation of concerns is a key aspect of software engineering.


> To implement any of the above, while studiously avoiding anything making explicit the fact that the interpretation of the bytes as a sequence of glyphs is an intended, necessary and separable step on the way, would be bizzarre and tendentious.

Codepoints are not glyphs. Nor are any useful operations generally performed on glyphs in the first place. Almost all interpretable operations you might want to do are better conceived of as operating as substrings of arbitrary length, rather than glyphs, and byte substrings do this better than unicode codepoint sequences anyway.

So I contest the position that interpreting bytes as a glyph sequence is a viable step at all.


Fair enough, codepoints, but the issue remains the same: you keep asserting that it is pointless - harmful, actually - to make use of this one particular interpretation from the hierarchy that exists, without offering any valid justification for why this one particular interpretation must be avoided, while both lower-level and higher-level interpretations are useful (necessary, even.)

Going back to the post I originally replied to, how would going down to a bytes view avoid the problems you see?


Let me rephrase. Codepoints are even less useful than abstract glyphs, cf. https://manishearth.github.io/blog/2017/01/14/stop-ascribing... (I don't agree 100% with the write-up, and in particular I would say that working on EGCs is still just punting the problem one more layer without resolving it; see some of my other posts in this thread. But it makes an attempt at clarifying the issue here.)

The choice of the bytes view specifically is just that it's the most popular view from which you can achieve one specific primitive: figuring out how much space a (sub)string occupies in whatever representation you store it in. A byte length achieves this. Of course, a length in bits or in utf-32 code units also achieves this, but I've found it rather uncommon to use utf-32 as a transfer encoding. So we need at least one string type with this property.

Other than this one particular niche, a codepoint view doesn't do much worse at most tasks. But it adds a layer of complexity while also not actually solving any of the problems you'd want it to. In fact, it papers over many of them, making it less obvious that the problems are still there to a team of eurocentric developers ... up until emoji suddenly become popular.

Now, I can understand the appeal of making your immediate problems vanish and leaving it for your successors, but I hope we can agree that it's not in good taste.


While all the facts in this post appear correct, they do not seem to me to amount to an argument either for the proposition that an implementation at the utf-8 level is uniquely harmful, or that a bytes-level approach avoids these problems.

For example, working with the utf-8 view does not somehow foreclose on knowing how much memory a (sub)string occupies, and it certainly does not follow that, because this involves regarding the string as a sequence of bytes, this is the only way to regard it.

For another, let's consider a point from the linked article: "One false assumption that’s often made is that code points are a single column wide. They’re not. They sometimes bunch up to form characters that fit in single “columns”. This is often dependent on the font, and if your application relies on this, you should be querying the font." How does taking a bytes view make this any less of a potential problem?

Is a team of eurocentric developers likely to do any better working with bytes? Their misconceptions would seem to be at a higher level of abstraction than either bytes or utf-8.

You are claiming that taking a utf-8 view is an additional layer of complexity, but how does it simplify things to do all your operations at the byte level? Using utf-8 is more complex than using ascii, but that is beside the point: we have left ascii behind and replaced it with other, more capable abstractions, and it is a universal principle of software engineering that we should make use of abstractions, because they simplify things. It is also quite widely acknowledged that the use of types reduces the scope for error (every high-level language uses them.)


The burden of proof is on showing that the unicode view is, in your words, a more capable abstraction. My thesis is that it is not. This is not because it necessarily does anything worse (though it does). It must simply do something better. If there were actually anything at all it did better—well, I still wouldn't necessarily want it as a default but it would be a defensible abstraction.

The heart of the matter is that a Unicode codepoint sequence view of a string has no real use case.

There is no "universal principle" that we use abstractions always, regardless of whether they fit the problem; that's cargo-culting. An abstraction that does no work is, ceteris paribus, worse than not having it at all.


> The burden of proof is on showing that the unicode view is, in your words, a more capable abstraction. My thesis is that it is not.

The quote, as you presented it, leaves open the question: more capable than what? Well, there's no doubt about it if you go back to my original post: more capable than ascii. Up until now, as far as I can tell, your thesis has not been that unicode is less capable than ascii, but if that's what your argument hangs on, go ahead - make that case.

What your thesis has been, up to this point, is that manipulating text as bytes is better, to the extent that doing it as unicode is harmful.

> It must simply do something better. If there were actually anything at all it did better...

It is amusing that you mentioned the burden of proof earlier, because what you have completely avoided doing so far is justify your position that manipulating bytes is better - for example, you have not answered any of the questions I posed in my previous post.

> The heart of the matter is that a Unicode codepoint sequence view of a string has no real use case.

Here we have another assertion presented without justification.

> There is no "universal principle" that we use abstractions always, regardless of whether they fit the problem...

It is about as close as anthing gets to a universal principle in software engineering, and if you want to disagree on that, go ahead, I'm ready to defend that point of view.

>... that's cargo-culting.

How about presenting an actual argument, instead of this bullshit?

Furthermore, you could take that statement out of my previous post, and it would do nothing to support the thesis you had been pushing up to that point. You seem to be seeking anything in my words that you think you can argue against, without regard to relevance - but in doing so, you might be digging a deeper hole.

> An abstraction that does no work is, ceteris paribus, worse than not having it at all.

Your use of a Latin phrase does not alter the fact that you are still making unsubstantiated claims.


Put it this way: claim a use-case you believe the unicode view does better on than an array of bytes. Since you're making the positive claim, this should be easy.

I guarantee you there will be a quick counterexample to demonstrate that the claimed use-case is incorrect. There always is.

You may review the gish gallop in the other branch of this thread for inspiration.


Now you are attempting a full-on burden-shifting approach, but the unsupported claims here are that a unicode view is "fundamentally wrong" and that the proper approach is to operate on raw bytes. You can start on correcting this omission by answering the questions I posed about your claims a couple of posts ago.

https://news.ycombinator.com/item?id=25895523


And you can start holding a position instead of claiming omission until the cows come home.


It is highly apposite that you should mention a Gish gallop in your earlier post: Duane Gish's combination of rhetorical ploys intended to persuade uninformed people that evolution hasn't happened. It includes a combination of burden-shifting, not answering awkward questions, introducing non-sequiturs, and attempting to change the subject, all of which you have already been employing, and which you are now intimating that you intend to continue with.

A Gish gallop is not, of course, a sound way to arrive at any sort of truth. Anyone employing the technique is either unaware of that, or is being duplicitous (my guess is that Gish never really understood that it is bogus.)

Meanwhile, those questions are still waiting to be milked, so to speak...

https://news.ycombinator.com/item?id=25895523


If I recall, this is the solution: https://stackoverflow.com/a/27185688

I don't know why there isn't a sys.argvb as there is is.environb


Fwiw, the python3 version didn't run at all for me in Python 3.9.0 on Mac.

    "UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: