Memory use: Unicode scalar values go up to 0x10ffff, which on most machines mean...

paulddraper · on June 17, 2017

On the contrary, UTF-8 is the one that is long, up to 50% longer than UTF-32. (Unless you happen to have a disproportionate number of low code points.)

No free lunches!

MrManatee · on June 18, 2017

That's UTF-16, not UTF-32.

UTF-8 is one to four bytes, UTF-16 is two or four bytes, and UTF-32 is always four bytes. For some code points, UTF-8 is 50% longer than UTF-16 (3 vs 2), but UTF-8 is never longer than UTF-32.

panic · on June 17, 2017

Sure, UTF-8 isn't always the shortest, but for many common strings (like JSON-encoded objects with ASCII keys) it is much shorter than UTF-32. The point is that using a list representation means you can't do any better than UTF-32, even if you wanted to.

paulddraper · on June 17, 2017

If you have ASCII, might I recommend the ASCII character set and encoding?