Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why should the OS mess with application data? I think syscalls should treat text as the blob it is and not care about the encoding at all.


File name is a string, not a blob.


Yes and my argument is that the OS should treat strings as a blob and not care about the encoding. How can it know what shiny new encoding the program uses? Encoding is a concern of the program, the OS should just leave it alone and not try to decode it.


The OS treats strings as a blob, yes, but typically specifies that they're a blob of nul-terminated data.

Unfortunately some text encodings (UTF-16 among them) use nuls for codepoints other than U+00. In fact UTF-16 will use nuls for every character before U+100, in other words all of ASCII and Latin-1. Therefore you can't just support _all_ text encodings for filenames on these OSes, unless the OS provides a second syscall for it (this is what Windows did since they wanted to use UTF-16LE across the board).

I've only mentioned syscalls in this, in truth it extends all through the C stdlib which everything ends up using in some way as well.


You should not be passing file names in different encodings because other apps won't be able to display them properly. There should be one standard encoding for file names. It would also help with things like looking up a name ignoring case and extra spaces.


I mean, I agree there _should_ be one standard encoding, but the Unix API (to pick the example I'm closest to) predates these nuances. All it says is that filenames are a string [of bytes] and can't contain the bytes '/' or '\0'.

It is good for an implementation to enforce this at some level, sure. MacOS has proved features like case insensitivity and unicode normalization can be integrated with Unix filename APIs.


You're right I missed that. Sounds like blob size should be communicated out of band.


File name is not a blob because it is entered by the user as a text string and displayed for the user as a text string and not as a bunch of hex digits. Also, it cannot contain some characters (like slash or null) so it's not a blob anyway.

And you should be using one specified encoding for file names if you want them to be displayed correctly in all applications. It would be inconvenient if different applications stored file names in different encodings.

For the same reason, encoding should be specified in libraries documentation for all functions accepting or returning strings.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: