Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why you shouldn't parse the output of ls (wooledge.org)
31 points by dgellow on July 6, 2014 | hide | past | favorite | 18 comments


This is the big thing that disappointed and frustrated me after I had spent a bunch of time hacking on Lisp Machines and then switched to Unix: in the Unix world, everything but everything is character strings. On the LispM, when you called 'directory', you'd get back a list of pathname objects. All the system interfaces were like that; it was hardly ever necessary to parse anything -- and when you did have to, it would be in s-expression format, so all you'd have to do is call 'read' on it.

In contrast, Unix is a Babel of different syntaxes. Every basic command like 'ls' has its own output syntax; every configuration file is in a different syntax. (Command line parsing isn't standardized either, but that train wreck deserves another conversation.)

In the case of the LispM all this was achieved by running the entire OS and all apps in a single address space; this obviously made passing objects between apps trivial, but at the price of a complete absence of security. Such a design would be a non-starter today. However, what you could do today would be to specify a standard system-wide serialization format, and give all the basic system commands an option to generate it. S-expressions would work great, but if you can't stand them, okay, use JSON. (Don't even think about using XML.)

The result would be, instead of just piping text strings from one app to another, you could, in effect, pipe objects. It's a far more powerful paradigm and would save you all this parsing pain.


> everything but everything is character strings

Actually it's worse: they're byte streams. They don't have to be decodable as any encoding, can contain weird control characters, etc.


Anybody interested can read the UHH: http://web.mit.edu/~simsong/www/ugh.pdf


The utility 'find' has a nice parameter you can use if you need to parse its output.

    find ./ -type f -print0
Using the '-print0' option will output a null terminated list. Since Linux filenames can't contain nulls, you can reliably parse the output.


Indeed, xargs (often combined with 'find') has a -0 option indicating that the input is null-terminated. So you often see:

  find ./ -type f -print0 | xargs -0 ...


All of this makes me happy that I use Powershell, where my output isn't some text I need to carefully parse (avoiding edge cases), but a list of objects, each of which has properties for me to interrogate.


Of course bash is not the only shell, nor is it the only approach for scripting on Linux/Unix. See e.g. perl, python, etc.


Linux seems like a very nice operating system. I can't wait until it gets a serious command line interface.


Some of these mistakes are detected by ShellCheck:

http://www.shellcheck.net/


It's good to be aware of these pitfalls, but in practice they often don't arise. If you're parsing log files, or any other system-generated files with sane filenames (no spaces, or other odd characters) you won't have an issue. Still, I normally would never attempt to parse 'ls' for this sort of thing. The preferred approach in a shell script is to use the shell's globbing capabilities (as in the example given):

  for f in *; do
      [[ -e $f ]] || continue
      ...
  done


I think if you're trying to use the shell for something other than some basic program launch glue, you are doing it wrong.

In C, readdir returns a perfectly usable struct dirent * with no parsing issues to worry about.

Python also provides a usable Unix layer for automation.


Every so often, I'll find myself frustrated with some bash script. For me, once a shell script gets to that point, it's best to rewrite it in Python. Translating a script into Python is almost fun, and I find the result much more maintainable. BTW, the original submission is great. I've written shell scripts over the years that have made this very mistake! For example doing something like:

   ls pj* | wc -l
Which normally returns the number of pj* files, but will fail for pathological file names as the submission points out.


Bingo.

Also, if you're creating filenames with newline and escape chars - well, good luck with that.


Does not work well if you also care about hidden files :

[simula67@hades test_bash]$ touch .hidden

[simula67@hades test_bash]$ touch not_hidden

[simula67@hades test_bash]$ find . -type f

./not_hidden

./.hidden

[simula67@hades test_bash]$ ls -al

total 8

drwxr-xr-x 2 simula67 simula67 4096 Jul 7 00:32 .

drwx------ 40 simula67 simula67 4096 Jul 7 00:31 ..

-rw-r--r-- 1 simula67 simula67 0 Jul 7 00:33 .hidden

-rw-r--r-- 1 simula67 simula67 0 Jul 7 00:33 not_hidden

[simula67@hades test_bash]$ for f in ; do echo $f; done

not_hidden

You have use shopt -s dotglob

[simula67@hades test_bash]$ shopt -s dotglob

[simula67@hades test_bash]$ for f in ; do echo $f; done

.hidden

not_hidden


This is is one of those "Required Reading" posts for *nix users.. Thank you for this!



This unix.stackexchange post[0] is relevant as well.

[0]: http://unix.stackexchange.com/q/128985/24124


That question/argument is a fine case study in some form of bug in the human mind, the name of which I know not.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: