Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hit the link expecting to read about UTF-8 Byte Order Marks at the top of the file, so that the first few bytes aren't actually #! but 0xEF 0xBB 0xBF #! instead. Ran into this one just a few months ago when a coworker who uses Windows had checked a Bash script into the Git repo. His editor was configured to save files as "UTF-8 with BOM" and so we were getting errors that looked like "./doit.sh: line 1: #!/bin/bash: No such file or directory". Can you see the invisible BOMb in that line? It's there, I promise you.

That's not what the article was actually about, as it turned out. The surprise in the article was about relative paths for script shebang lines. Which was useful to learn about, of course, but I was actually surprised by the surprise.





UTF-8 doesn't have a BOM. UTF-16 does. UTF-32 does. "UTF-8 with BOM" is not a standards-compliant text format, it's a proprietary binary format that happens to have a bunch of embedded UTF-8. Just because you can run `strings` on a file & get a bunch of text out doesn't mean it's a text file!

This seems a bit pedantic, while you may be correct (I honestly don't know what standard this is referring to) the UTF-8 BOM is a thing that some tools do know about. Even then in the context of OP's question the BOM with UTF-8 isn't the specific problem but rather how the shebang interpreter reads the actual ASCII byte sequences so a UTF-16 with a BOM "text" file would also fail.

tbh it is lame for any program reading a text file to not support BOM. It's just one if.

There isn't really any one "text file" though, the kernel looks for the first two bytes to match what "#!" corresponds to in ASCII.

https://www.youtube.com/watch?v=J8nblo6BawU is some great watching on how "Plain text isn't that simple"


UTF-8 is a text format with no BOM. Just like ASCII doesn't support a BOM. The BOM is a UTF-16 or UTF-32 thing, so "UTF-8 with BOM" is a binary file that happens to contain some UTF-8 strings as well. Since it's not a text file, it makes sense that utilities expecting text files don't handle it.

Eh? A utf8 file starting with ZERO WIDTH NO-BREAK SPACE is not a text file? How do you figure that?

If it starts with 0xFE 0xFF, but is otherwise UTF-8 instead of UTF-16, it's a binary file. If it starts with 0xEF 0xBB 0xBF, it's a text file with a ZERO WIDTH NO-BREAK SPACE at the start.

> If it starts with 0xFE 0xFF, but is otherwise UTF-8 instead of UTF-16, it's a binary file

Sure, but who does this? All the Microsoft tooling writes 0xEF 0xBB 0xBF if you output utf8 with a BOM.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: