The main problem is using text as a common format between different applications.
First: text is not well defined. Is it ASCII? Is it UTF-8? Some programs can spew UTF-32 with proper locale configured, it's a mess.
Second: encoding and decoding of objects to text is not defined at all. Those problems with filenames is just one example. Using newline as a separator is a natural thing that is easy to implement, yet it is wrong.
In my opinion two things should be done:
1. Standardise on UTF-8. No other encodings allowed.
2. Standardise on JSON. It is good enough to serve as universal exchange format, tools like `jq` exist for some time now.
So any utility must read and write JSON objects with some standard env set. And shells can be developed with better syntax to deal with JSON. This way you can write something like
`ps aux | while read row; do echo ${row.user} ${row.pid}; done`
>It is good enough to serve as universal exchange format, tools like `jq` exist for some time now.
Please don't use that underdefined joke of a spec. Define "PosixJson" and use that instead. Right now it's not even clear what the result of parsing {"a": 1234678901234567890} is. Is this a parse error? A bigint? A float/double? Quiet wraparound? Something else? I've seen all these behaviors in real world JSON implementations across different languages.
> A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character.
So, if you have some non-printable characters like BEL/␇/ASCII 0x07, that's still a text file.
(and I believe what bytes count as a valid character depend on your `LC_CTYPE`).
But the moment you have a line longer than {LINE_MAX} bytes (which can depend on which POSIX environment you have), suddenly your text file is now a binary file.
Kind of a weird definition indeed. One edge case: the definition states the file must contain characters, so presumably zero length files are out. But then how could you have zero lines?
Yes obviously. But the POSIX specification for a "text file" as above is that it contains characters, which an empty file by definition does not. So an empty file cannot be a text file if you read that specification strictly, and therefore you cannot have zero lines in a text file. As soon as you have a single character there is at least one line, and the amount of lines can only stay the same or grow from there.
The definition should read "one or more lines" instead or (probably better) specify that a text file contains "zero or more characters".
What cursed madness have you hit that spits out UTF-32 under normal conditions?! That can only be a bug - UTF-32/UCS-4 never saw external use, and has only ever been used for in-memory fixed-width character representation, e.g. runes in Go.
You never have to worry about whether you're dealing with ASCII vs. UTF-8, but rather if you're dealing with UTF-8 vs. ISO-8859-1, or worse, Shift JIS or similar.
I think a lot of tools should support json as well as plain text. Probably the latter by default, and the former with a "-o json" or similar option. I'm fine with wc giving me `5`, I'd prefer that to `{ "characters": 5 }`.
There are exchange formats that are well-defined enough to be useful to many computers while also being readable enough to be traversed by human eyes. There's no reason to everything ad-hoc, you don't get much by that. You also control the shell itself - there's no reason you can't display object representations in a pretty way.
JSON itself is bad for a streaming interface, as is common with CLI applications. You can't easily consume a JSON array without first reading it in its entirety. JSONL would be a better fit.
But then, how well would it work for ad-hoc usage, which is probably one of the biggest uses of shells?
First: text is not well defined. Is it ASCII? Is it UTF-8? Some programs can spew UTF-32 with proper locale configured, it's a mess.
Second: encoding and decoding of objects to text is not defined at all. Those problems with filenames is just one example. Using newline as a separator is a natural thing that is easy to implement, yet it is wrong.
In my opinion two things should be done:
1. Standardise on UTF-8. No other encodings allowed.
2. Standardise on JSON. It is good enough to serve as universal exchange format, tools like `jq` exist for some time now.
So any utility must read and write JSON objects with some standard env set. And shells can be developed with better syntax to deal with JSON. This way you can write something like
`ps aux | while read row; do echo ${row.user} ${row.pid}; done`