I learned perl in college in the 5.6 days, and did a lot of text processing with it for a time. At some point, I bit the bullet and learned awk, and I've mostly abandoned perl as a result.
Why? awk is small enough that it fits in my head, or at least the bits I need every couple of months do. And if I forget, it only takes 20 minutes to put them back in. perl, by contrast, is far too large to fit in my head and comes with an ecosystem that is much larger still.
I know I can use perl for what I use awk for, and when I've raised this point before, people have been quick to explain how to process input by lines, and conditionally do something. For basic stuff, the fact that that's basically all awk does[0] means there's so many fewer ways to do it wrong. I can't say the same of perl, even when I was more familiar with it.
caveat: awk, nawk, mawk, and gawk don't necessarily share the same set of corner cases, and you may not get error messages that make sense to you when you bump into one with an unfamiliar awk.
[0] I know, not really, and especially for gawk. I've written awk scripts a couple hundred lines in the past. It's true enough for the 20 minute version though.
One particularly mnemonic collection of switches is 'plane':
perl -plane 'my $script'
which iterates over all files given on the command-line (or stdin) and
+ (p)rints every processed line back out
+ deals with (l)ine endings, in and out
+ (a)utosplits every line into @F
I am aware that -n and -p are mutually exclusive, but as -p overrides -n, it's seems simpler to just keep 'plane' in mind and remove the 'p' if necessary.
I’m a bit surprised by this. It surely depends on the use case. I tend to use awk for the same reasons as the parent (smaller to write and remember) and that the pattern-action structure works really nicely once you get used to it.
That depends on what you are doing. Which test did you run?
Around 10 years ago I rewrote markdown.pl in awk and it was almost 20 times faster. The speedup came from both a much faster startup time and a much faster (albeit simpler) regexp implementation.
I'm tending towards Python for anything more than a single line shell script, because I find it easier to keep Python scripts tidy and well-structured. Not saying this is impossible, but I find the bash syntax for nontrivial operations hard to learn. This probably also applies if you replace bash with awk and python with perl.
What are you using Perl/awk for that it's needed so often?
I've been a software engineer since 2005 and worked my way up to being a VP of Engineering currently and never had to use either Perl or awk (or similar). I often read about these tools on Hackernews and I find it quite mystifying as I manage to have written Java, Scala, C#, SQL, and so on for 15 years and happily never needed them.
Is this a certain kind of engineering job that requires searching through text files so often and requiring specialized tools? I've managed my whole career with ctrl-f, and highlight-all matches.
There are hundreds of use-cases. It is not important what the use-cases are. - There are two types of developers: Those who use hacking to solve all kinds of tasks that need hundreds of manual steps. And those who never think of automating smaller steps. The second kind of devs can be very good software engineers, but they prefer IDEs instead of tweaking Vim or Emacs. The first kind of dev will look for possiblities to automate repeated steps, opportunities for tweaking, transforming, hacking. For the fun of it.
I think it depends a lot on the sort of software one works with. A main use of awk and other unix tools for me is as-hoc data munging, looking through production logs, taking bits of data from different sources and comparing them. If you make eg gui applications and sell them or work as a contractor developing in-house applications for clients, you probably don’t see a lot of production logs like that or process them that way. Any data that you expect to process ought to go into a database and then you can use sql (where joins work much better than the unix join command and you can use the actual structure of the data instead of trying to tease it out with the smallest simplest code you can think of). You might be using a debugger or backtracks or reproduction in test to deal with/investigate production issues rather than starting with logs. On the other hand, many other companies will have big production systems that run on lots of different virtual machines and produce lots of logs and have issues that are hard to reproduce (or maybe the logs are easier), and have lots of data spread about random places such that it may be easier to do the hacky thing for one-off cases or to prototype or whatever rather than Doing things more thoroughly or properly. And this set up will also lead to tools that are designed to fit into the rest of the unix ETL tack by the way they output or input data.
On the second point, I think that even as a die-hard emacs user I wouldn’t recommend that anyone try to use emacs as an ide for something like java (or maybe c++). I’d probably try to use emacs a myself but I think my java-editing experience would be worse than for those people who do it with an IDE. More realistically, I’d try to avoid writing java if at all possible.
In all that time, have you strictly been writing code, on computers that came set up out of the box, that runs on computers managed by other people? This is not an insult, I’m aware such things exist, but I seem to travel in different circles. It may come from a certain era of being into computers as a hobby for a long time before I could call myself a software engineer, and having to do what we now call DevOps necessarily by working for companies that didn’t have the resources to hire a fleet of system administrators. As a result I may live in a bubble where the friends and colleagues I associate with are as comfortable with a command line as they are with an IDE, and frequently jump between the two.
I mention DevOps because it is hard to think of a situation where you’re just coding Java and C# all day that truly requires awk-or-similar. But just the other day I had a situation where I copied the contents of a CD-ROM to a web server — drag and drop on my Mac — and ended up with the files all in upper case. The program I was trying to run was failing because it requested everything with lower-case URLs. There were hundreds and hundreds of files in a bunch of directories, so I fixed it with Perl in about 30 seconds.
I’m curious if that’s the sort of problem you never encounter, or if it is, what you reach for.
I'm sure use cases vary to some degree, but awk clicked for me when I had some data that wasn't formatted well. I didn't use it in a script, I just worked in the shell and kept iterating until it looked good, then redirected it into a text file. I use it for quick one-offs like this all the time now.
I've also used awk to get button presses from an input stream on a MIDI controller. For me, I found that the up front cost of learning a few awk commands quickly made it's return on investment.
This. It's often a good enough tool for solving problems that technically require a specialized tool. Except learning the specialized tool takes an order of magnitude more time than iterating on awk to get a good-enough solution.
I'm not proud to admit it, but I've used awk in a couple places where I should have properly used `expect' instead. Except, I don't know expect well and haven't gotten to the point where it was worth the investment in learning it.
Are you working in a Linux development environment? Targeting Linux systems? You're not going to come across much of a use for awk if you're in the Windows world. I'm making an assumption based on C#.
I've used perl and awk extensively for all sorts of things. Log parsing embedded device logs to generate reports was a big one for me, that's where I really learned both languages. And they are superb tools for that task. We also used awk quite a bit for configuration parsing as a sort of intermediary between different processes and scripts on those devices.
As I moved into web development, I've found myself using both much less. I haven't touched perl more than once in the last 6 years -- I was given a script to maintain recently but it only needed an hour or two of my time to add a small feature to. That perl script is part of a legacy build system. I still use awk every couple of months but it's just part of some pipeline on the command line.
Using any kind of management position as justification for anything tech-related won't have the effect you're probably aiming for. Dev going management is always downshifting.
So I just wrote one for personal use. I was looking for a duplicate file/photo finder and read some reviews about losing data so I said why would I trust a third party for my data and wrote a dupe finder. Use a hash (md5) to make a match (it is not perfect but it works) and then sort and print to show me the dupes. The first version was 45 lines of code although I could have reduced lines if I tried. You don't need awk strictly but I used it to prettify my output and add some filters. Current code with lots more bells and whistles is less than 150 lines (including white space) of bash and awk. And it works for me and I trust it.
I could set up a database, write multiple layers of code ... but really?
I don't, and that's half the problem :-) If I did, I'd probably remember enough perl that I'd have never bit the bullet and learned awk!
To answer the serious question, it's often enough for a first order solution to any problem where you have line and field separated text. Pulling instances of "something weird" out of log files, for instance, is a great use for awk, especially if the fingerprint of "something weird" is scattered through fields in a line in a way that makes grep cumbersome. Or if it spans a couple of lines, especially if you're dealing with a log from a multi-threaded app where there might be irrelevant lines interleaved with the ones you care about.
I haven't written much production-grade awk, but I've often used it as a tool to understand a problem well enough to write a production-grade solution to a problem or fix bugs in application code.
The author offers one answer to this in his opening text:
I say it's useful on servers because log files, dump files,
or whatever text format servers end up dumping to disk will
tend to grow large, and you'll have many of them per server.
If you ever get into the situation where you have to analyze
gigabytes of files from 50 different servers without tools
like Splunk or its equivalents, it would feel fairly bad to
have to download all these files locally to then drive some
forensics on them.
This personally happens to me when some Erlang nodes tend to
die and leave a crash dump of 700MB to 4GB behind, or on
smaller individual servers (say a VPS) where I need to
quickly go through logs, looking for a common pattern.
You ever deal with very large sets of large text files? You really CTRL-f through multi-GB files? Have you never needed to transform data between formats or yank out subsets of data for processing? You may have just working on business systems that did not deal with significant data volumes or data you did not control (coming from some other organization, for example).
People (me anyway) used perl back in the dark ages of the 90s for the same sorts of things people use python for now. It was really the only option for open source general purpose interpreters for quite a while. People still use it in biotech I think, or were 5 years ago; the text processing capabilities are useful for genomic data.
Awk/sed and the ETL stack that comes with every Unix-like OS (aka od, tr, cut, sort) and all that are superb data wrangling tools. Log file parsing, data cleaning, even actual data science at scale can be done with these tools. They're extremely efficient, as they're designed from an era when pretty much all interesting data was comparable to or much larger than memory size. As such, you can do a lot of stuff with them that most people don't imagine is even possible. FWIIW for high end data scientists; I don't consider knowledge of these tools to be optional at all. Anyone who hasn't used them in their career hasn't worked on serious problems, and is probably the kind of educated idiot who will suggest you do a job in a giant Hadoop cluster that you could easily do on one machine.
Perl 6 is when they then took LSD and at a hang-up, they went to the clinic and came back a s two persons. One said: never again, the other: bring it on.
My absolute favorite example of what's possible with awk is this[0] calculator from Ward Cunningham about splitting expenses on a ski trip. It's a really beautiful little piece of code well-adapted to this problem.
This is the sort of thing us industry professionals need to study in more depth. It doesn't have the most flexible interface, but it's human-readable, and the implementation is simple, fast, and about as obviously correct as possible.
I glanced through the input and output. When I got to the code, I was startled by how short it was. This page is worthy of its own post on Hacker News. I am now convinced that Ward Cunningham is a genius. I was already pretty sure after reading how he invented the wiki.
This reminds me of a story of a guy who wrote a whole company internal debit system like this (coffee, meals out, etc) and it turned into a currency. I feel like I either saw it here, or a similar forum. Anyone have a link? My searches have not found it...
It's also worth mentioning that local variables can be simulated using additional formal parameters. In AWK, any missing parameter in a function call is initialized to zero.
Let's say we have a function CharCount which takes a character and a line of text and returns the number of occurrences of that character:
function CharCount(ch, line,
n)
{
...
}
The line break in the parameter list is an AWK convention and indicates that n is a "local variable."
I love using awk. It was fairly easy to pick up, and it slides into my command line workflow pretty well.
That said, there's an amazing amount you can do with it, if you really try. Someone once joked that I should try building an IRC bot in AWK, so I did:
https://github.com/Marcus316/rufus
There's no pactical reason for it, but it was fun to play with the idea.
Very nice conclusion of the language. However, keep in mind that awk is not the fastest language around. A pro awk thread might not be the best place to tell this story but it is fresh and true:
Last weekend I was playing around with some data. At first, I thought 'let's just write a line of awk and be done with it' and so I did. The execution took 20 seconds (about 17 million lines) and everything was fine.
Later that day, I came across another task which seemed too complex for an awk one-liner so I took two lines of R and was surprised when R was done within 5 seconds on the same data set.
I was happy because I found a faster tool than the one I had, but the lesson is, that just because you use a proven tool like awk, doesn't mean there aren't any better tools. Find out what works best for you.
awk is a language with multiple implementations, so we can only talk about performance of a particular awk implementation. Which is especially true for awk because mawk (based on a vm rather than being a traditional interpreter) is so much faster than other awks if you can live with its limitations [1]. If you're using gawk, setting LANG=C also helps performance because no utf-8 honoring needs to be done. Speaking of which, I've noticed gawk on recent Ubuntus (19.10) appears broken and/or has regexp size limits breaking awk code with large regexpes (but haven't checked yet thorougly).
Was it a faster tool? You had a need, you tossed awk at it and accomplished your goal. Did the 20 seconds vs. 5 seconds matter? The time you spent deciding what to do and how to do it took more time. For a one-off, grab some data type of task, the tool that you know well will almost always be the fastest because you get to the end result the quickest.
I agree with the comment below. If you haven't already, try mawk instead of awk. It is often many times faster than awk (and other solutions).
What kind of task was it? I’d imagine loading a csv-like file and doing some somewhat heavy calculations, R would win, but when I think of AWK problems I think of text manipulation, and when I think of manipulating text I do not think of R
Disclaimer: I don’t actually write any AWK, but learning it is on my bucket list.
AFAIK, calculating the difference to its previous line for every line should be faster for data that is not sorted. I didn't try to write that one in R though.
I am sure there ways to make both things faster, I was just surprised as I didn't expect R to be faster with those two naive implementations.
I have never found anything faster then mawk. GNU awk (gawk) is orders of magnitude slower. For a proper comparison you would have to note which implementation of awk you are using.
Whenever somebody says "Oh, I don't know how to use awk", this is the link I send them. It's easily the most useful tutorial for awk on the 'net. There's a lot more you can do with awk, but this site shows you the top percentage in a very short amount of time.
It's crazy how this reappeared on HN only hours after I went searching for awk on Hacker News. Fred Hebert is so well known in the Erlang and Elixir communities.
I have just coined this idea the "gateway drug" approach to learning new tools. We should strive to find those introductions that are small enough that someone can digest without a massive upfront time investment to get them past the front door :)
I've heard many good things about that book, which is why I bought it several months ago. From the bits I have read, I like it. But I have not yet finished it, partly because it is not pamphlet sized. It is over 300 pages. But I look forward to the chapter on Filters.
Very cool and I'm grateful for Awk to be presented in such a friendly way.
However, at some point, it makes more sense to write your Awk script in something like Python, and my intuition says that that time is shortly after starting a basic Awk script. Using a real programming language is almost always the way to go with code that will become more complex over time (almost all code) and code you have to share with a team (learning curve).
Items can be put on a single line without ambiguity using semicolons:
pattern ; { action } ; pattern { action }
A pattern can have multiple patterns separated by a comma. That syntax admits optional line separation after the comma separators:
pattern,
pattern,
pattern {
action
}
The action fires by a match for any of the patterns.
The POSIX standard Awk expression grammar has no comma operator, on the other hand; the comma exists only for separating patterns, and function arguments/parameters, and in the print statement syntax.
Plus, JNIL (just now I learned). The in operator allows comma-separated expressions for testing "multi-dimensional" array membership. Here is a hello, world:
$ awk 'BEGIN { a[1,2,3] = 4 ; print (1,2,3) in a }'
1
(Note that multi-dimensional arrays in Awk are simulated; a string index is generated for that 1,2,3).
One trick I use with awk, especially throwaway ones, is to use grep to subset the data before feeding it into awk. Often the case is using awk to poke at some data to diagnose a problem, not write a script to be run often or stuck in cron.
So to use an example from this nice short article: I might do grep GET log-file | awk blah-blah. Then awk doesn’t need to consider the lines I don’t care about. This is especially useful when iteratively writing the awk script.
Oh yeah, sure! I should have been clear that when dredging a really large log file or such it can be faster to run awk on a much smaller subset when your awk program has a lot of productions. I find it easier to think "OK, just run awk on these lines that matter."
There are two use-cases I’ve run into where awk really shines, and is hard to replace:
1. Writing scripts for environments that only have Busybox. Technically you can write scripts in ash, but I don’t recommend it for anything beyond a couple lines. It’s missing a lot of the features from Bash that make scripting easier, and it’s easy to get mixed up if you’re used to Bash and write things that don’t work. Awk is the best scripting language available, even if you’re doing things that don’t exactly match what it was designed to do.
2. Snippets that are meant to be copy+pasted from documentation or how-to articles. In that case, it’s often not easy to distribute a separate script file, so a CLI “one-liner” is preferred. You also can’t count on Perl, Python, etc. being available on the user’s system, but awk is pretty universal.
For most other cases, I tend to create a new .py file and write a quick Python script. Even if it’s a little more overhead, it helps keep my Python skills sharp, and often it turns out that what I actually want is a little more complicated than my initial idea anyway.
Great post. I didn't know Fred had a blog. Fred's writing style never ceases to amaze me. His book on property based testing in Elixir was a better introduction for me to Python's Hypothesis than any of the other tutorials I found online on the topic because it made me think about discovering properties of my code. Would recommend it to others.
If the file paths have spaces in them, then you have to wrap the name within double quotes. I have found it challenging to output those in mawk -- at least easily any way. In this case, I have a windows version of tr.exe
I use awk when I need something a bit more powerful than "grep"/"cut", but don't want to pull out the big guns with Perl.
For example, recently I needed to print out certain fields of an output, but only for a given subset. So I used Awk to create a simple state machine (enable when I see the start of the subset, disable at the end), and print the fields of interest.
The grep/cut replacement is exactly what I use awk the most for. I use it for other things too but this stands out. awk just fixes the problem when you need to have more than one field from the input and maybe a different string between them. I guess this can be done with cut too, but for me it's simpler this way.
Also I use awk often in scripts together with fzf to interactively switch contexts e.g. in cloud provider CLIs.
Awk '{ print $2 }' does LWSP gobbling.. cut -d' ' -f2 doesn't and this difference alone makes awk useful to me on a daily basis.
Awk has a hash like perl which is very efficient. I use an awk expression to print uniq as they come in counted through the hash insert on new instead of uniq which prints at end.
Awk count unique over 300,000,000 ips was as fast as perl and python and smaller memory footprint
LWSP=Linear White Space
"Linear white space is: any number of spaces or horizontal tabs, and also newline (CRLF) if it is followed by at least one space or horizontal tab." [1]
I don't know what LWSP gobbling is, though. Google didn't help.
yes. whereas cut, considers ' ' (two spaces) as an instance of a blank column separated by the -d' ' space instance. Its like ,, in CSV being a blank column.
This made me very happy, since it goes to the 'you can either use awk or explain awk but not both' problem which is definitional in monads in Haskell. I thought LWSP was a thing. I forgot not everything is a thing. And gobbling is maybe very jargon ish. Python sys.stdin.rstrip().split() comes to mind.
If argument expansion is required, consider using read. That way the argument is given a name. Use "read a b c < x" instead of "b=$(cat x | awk '{ print $2 }')".
Just remember to pipe to read with care, as the right hand part of a pipe is a subshell in which variables are local. So "read a b c < <(echo 1 2 3)" works, "echo 1 2 3 | read a b c" doesn't.
Another way to expand arguments is to simply define a shell function and use $2. Something like "process_line() { echo $2 }".
When you have to reach for something like awk, your script would probably improve by being mostly awk. In which case most people are probably looking at perl or python anyway. Even if awk is a nice language, the arrival of perl mostly killed it. It is not wrong to say perl was the next version of awk.
If you just use awk for `{ print $2 }` then I would still prefer `tr -s ' ' | cut -d ' ' -f2`, since both are part of coreutils and with `awk` you would add a additional dependency to your script.
IMO, it would be great if 'cut' could accept more than just a single character for delimiters.
e.g. a regex or even a shell-style pattern match would be very useful. `cut -d ' +' -f2` would get rid of the need for 'tr' in your example.
Even just allowing >1 char delimiters would be helpful. A text list like 'foo, bar, foobar' is too much for cut, but it would be fine if cut accepted parameters like `cut -d ', '`
Why would being part of coreutils matter? Awk is part of POSIX, just like tr and cut. Even in very constraint environments you can count on it via busybox.
A perpetual argument is use of too many pipe separated distinct commands. The argument suggests it's lazy to use sed | awk | grep type pipe runs because in all probability sed or awk alone could have done it and you incurred two excess fork/exec() and therefore consumed kernel and userspace beyond your need. The usual rejoinder is "get nicked"
A tl;dr for the following tl;dr: it's great for quick-and-dirty processing of logs.
The tl;dr is that AWK is simple language that lets you use a line-oriented event driven programming model to process text. This abstraction is simple enough to be readable and maintainable, but powerful enough to parse and process text with line-oriented structure.
The example linked elsewhere is worth studying, because it truly is one of the prettiest pieces of programming I've seen in a long while:
If you don't know anything about AWK, here's all you need to know to grok this:
1. An AWK program is a list of [event] { code } pairs. All pairs are run in sequence on each line of input; if the "event" evaluates to true or is a matching regex, the matching code runs.
2. { code } on its own runs unconditionally.
3. The input is automatically whitespace delimited into fields; $1 refers to field 1, $2 to field 2, etc
4. NF is a special value that yields the number of fields
5. Assigning to fields replaces the text in those fields.
6. ($1+0) != 0 implicitly converts $1 to an integer; if it fails, the value is 0.
There is a lot of implicit loose typing in the language, and usage of undefined variables is idiomatic. Functions aren't easy to use, either. So it's not well-suited for programming in the large, or even the medium. But for the scale of programming seen in that link, it's truly a wonderful and simple power tool.
I use awk to write csv-parsers for my personal finance setup. They take transaction logs and convert them into a ledger format (https://www.ledger-cli.org/). There's some pretty trivial if/else branches based on regexp's, which awk is pretty good at expressing.
I'd normally write this sort of thing in python. The awk program is usually smaller than a comparable python program, but for me the main selling point is that awk programs a more "UNIX-pipe-native" than python programs are.
At some point, I'll probably rewrite these programs in a more "serious" programming language, once the file sizes get too big or something, but for now they're working great and are easy to extend.
I've used it for this too, but things tend to go south pretty quick if you have escaped or quoted commas. That, unfortunately, is where I usually break down and pull out python.
The risk of that seems to be if CSV allows embedded escaped quotes in a quoted string. Does it? I don't know. And CSV is pretty loosely defined. For most people it's probably "whatever Excel emits or ingests".
And I think that's why we're on the same page about pulling out python and using a module where somebody has explored what the corner cases are and dealt with them for us already.
Yea CSV is famously bespoke, I don’t know if it’s ever been standardized?
In my experience, each implementation is more or less unique. Even if you allow escaping, different formats permit different methods of escaping and you might have to account for each! shudders
It’s much easier to support that in python than it is in awk with regexp. The awk route will eventually make you a regexp wizard, though, which confers it’s own benefits. :)
Awk automatically operates on every line of a text file, so you can easily use it as part of a pipeline in Linux. I'll usually use grep, cut, sort, and awk all together with one line of code and no need to open an editor. Python takes a lot more code to do these basic things. However, Python is much better once the complexity grows past a certain point.
depends upon the usecase, for relatively simple column filtering, substitution, etc, awk script would be briefer than python equivalent and most likely be faster as well
if it becomes lengthy (and again depends on features required), then Python is likely to have inbuilt/3rd-party libraries to make it easier to write and maintain
if it is a question of constructing command line one-liners and using it as a part of other cli tools, then awk wins easily
for example:
awk -F'\\W+' -v OFS=, '{print $NF, $2}' input.txt
prints last column and second column, where non-word characters form the field separator and comma is used as output field separator
Processing text files line by line is trivial in awk. Maybe it's easy in Python too but at least in awk you don't have to worry about the 2 vs 3 silliness.
Plus I think semantic indentation is not sane language design.
Plus you can pipe it to another command with ease.
Nice intro. Awk is a useful tool when you want to do simple line-by-line processing.
Note: The article says that awk patterns can't capture groups. The standard doesn't provide that functionality, but if you use the widely-available gawk implementation, gawk does have that capability (use "match").
Last year, I attended a tech event where Rasmus Lerdorf (the creator of PHP) was a guest speaker. When asked about his favorite programming language, he said: AWK. He didn't elaborate why but looking at this article, I think there is something to it.
I recommend this article to people all the time and often come back to it myself. I don't use awk as much these days but when I worked in bioinformatics I was fluent and loved it.
Every time I set out to use awk for some one time task I struggle to express what I need and end up saying fuck it and fire up vim and do exactly what I want with some quick macros.
I usually persevere and get it done in awk but then find the next time I need awk that I don't remember any of it and can barely understand my own examples.
nice, I finally took the time to read the man pages for awk. And whipped out a script to count the number of errors occurred for a particular day for a postgres log file.
cat logfile | awk '/ERROR:/ {counts[$1] = counts[$1] + 1}; END { for (day in counts) print day " : " counts[day]}' | sort
I just needed to know how awk programs are structured, the rest is just simple programming!
EDIT: I'm not sure if it's actually correct however...
> cat logfile | awk '/ERROR:/ {counts[$1] = counts[$1] + 1}; END { for (day in counts) print day " : " counts[day]}' | sort
Great first program! a bit less verbose could be
> awk '/ERROR:/ {counts[$1]++}END{...}' logfile
there are also ways of sorting the output but within (g)awk (asort & asorti) but sorting externally as you have is more flexible and engages another core which can be faster on large input
There's really nothing useless about that use of cat: it makes the pipeline compose better from left to right. It's not like you have to pay 25 cents for each process you spawn.
It's not detrimental to performance since an empty cat is a no-op in a pipeline. You can have any number of them. But commands should be written for humans to understand, and inserting no-ops is a distraction to the reader.
In the trivial example, "grep needle haystack" reads better than "cat haystack | grep needle".
This is a great intro; I have the GNU Awk user's manual bookmarked because there are a lot of features in gawk you will only rarely use but are quite useful.
I've recently modified and published that tutorial as an ebook [1], which can be read from the github repo or downloaded as PDF (currently free). I am also updating the book to include exercises, other minor improvements as well as epub version - expected release by next weekend.
# common lines
comm -12 <(sort file1) <(sort file2)
# lines unique to first file
comm -23 <(sort file1) <(sort file2)
# lines unique to second file
comm -13 <(sort file1) <(sort file2)
regarding readability, it is the same with any new tool or programming language, you'd need to be familiar with its syntax and idioms, someone not familiar with command line and sort/uniq commands will find your solution as alien as well
Why? awk is small enough that it fits in my head, or at least the bits I need every couple of months do. And if I forget, it only takes 20 minutes to put them back in. perl, by contrast, is far too large to fit in my head and comes with an ecosystem that is much larger still.
I know I can use perl for what I use awk for, and when I've raised this point before, people have been quick to explain how to process input by lines, and conditionally do something. For basic stuff, the fact that that's basically all awk does[0] means there's so many fewer ways to do it wrong. I can't say the same of perl, even when I was more familiar with it.
caveat: awk, nawk, mawk, and gawk don't necessarily share the same set of corner cases, and you may not get error messages that make sense to you when you bump into one with an unfamiliar awk.
[0] I know, not really, and especially for gawk. I've written awk scripts a couple hundred lines in the past. It's true enough for the 20 minute version though.
Edited for clarity.