Awk in 20 Minutes (2015)

mauvehaus · on May 2, 2020

I learned perl in college in the 5.6 days, and did a lot of text processing with it for a time. At some point, I bit the bullet and learned awk, and I've mostly abandoned perl as a result.

Why? awk is small enough that it fits in my head, or at least the bits I need every couple of months do. And if I forget, it only takes 20 minutes to put them back in. perl, by contrast, is far too large to fit in my head and comes with an ecosystem that is much larger still.

I know I can use perl for what I use awk for, and when I've raised this point before, people have been quick to explain how to process input by lines, and conditionally do something. For basic stuff, the fact that that's basically all awk does[0] means there's so many fewer ways to do it wrong. I can't say the same of perl, even when I was more familiar with it.

caveat: awk, nawk, mawk, and gawk don't necessarily share the same set of corner cases, and you may not get error messages that make sense to you when you bump into one with an unfamiliar awk.

[0] I know, not really, and especially for gawk. I've written awk scripts a couple hundred lines in the past. It's true enough for the 20 minute version though.

Edited for clarity.

AceJohnny2 · on May 2, 2020

For my & others' reference (because I keep having to look it up, just like parent):

    perl -ne "print $_;"

'-n' will run the expression over each line of the input. This is the Awk-like mode.

    perl -pe "s/foo/bar/"

'-p' will run the expression over each line of input, and print out the (possibly modified) line. This is the Sed-like mode.

Slightly more info available in `man perlrun` or https://perldoc.perl.org/perlrun.html

hf · on May 2, 2020

One particularly mnemonic collection of switches is 'plane':

    perl -plane 'my $script'

which iterates over all files given on the command-line (or stdin) and

  + (p)rints every processed line back out
  + deals with (l)ine endings, in and out
  + (a)utosplits every line into @F

I am aware that -n and -p are mutually exclusive, but as -p overrides -n, it's seems simpler to just keep 'plane' in mind and remove the 'p' if necessary.

forinti · on May 2, 2020

On the other hand, Perl is a lot faster.

One small test (with a big file) I executed took 1 minute with Perl and 20 minutes with awk.

And then there are those really complicated formats where awk is just not flexible enough.

Awk is really useful, but it doesn't cover the same problem set as Perl does.

dan-robertson · on May 2, 2020

I’m a bit surprised by this. It surely depends on the use case. I tend to use awk for the same reasons as the parent (smaller to write and remember) and that the pattern-action structure works really nicely once you get used to it.

yiyus · on May 2, 2020

That depends on what you are doing. Which test did you run?

Around 10 years ago I rewrote markdown.pl in awk and it was almost 20 times faster. The speedup came from both a much faster startup time and a much faster (albeit simpler) regexp implementation.

DmitryOlshansky · on May 2, 2020

There are many AWK implementations, OSX I think have subpar AWK. GAWK was fast, last time checked.

_pvxk · on May 2, 2020

what script, and which awk?

eternauta3k · on May 2, 2020

I'm tending towards Python for anything more than a single line shell script, because I find it easier to keep Python scripts tidy and well-structured. Not saying this is impossible, but I find the bash syntax for nontrivial operations hard to learn. This probably also applies if you replace bash with awk and python with perl.

saberience · on May 2, 2020

What are you using Perl/awk for that it's needed so often?

I've been a software engineer since 2005 and worked my way up to being a VP of Engineering currently and never had to use either Perl or awk (or similar). I often read about these tools on Hackernews and I find it quite mystifying as I manage to have written Java, Scala, C#, SQL, and so on for 15 years and happily never needed them.

Is this a certain kind of engineering job that requires searching through text files so often and requiring specialized tools? I've managed my whole career with ctrl-f, and highlight-all matches.

submeta · on May 2, 2020

There are hundreds of use-cases. It is not important what the use-cases are. - There are two types of developers: Those who use hacking to solve all kinds of tasks that need hundreds of manual steps. And those who never think of automating smaller steps. The second kind of devs can be very good software engineers, but they prefer IDEs instead of tweaking Vim or Emacs. The first kind of dev will look for possiblities to automate repeated steps, opportunities for tweaking, transforming, hacking. For the fun of it.

dan-robertson · on May 2, 2020

I think this argument is unfairly dismissive.

I think it depends a lot on the sort of software one works with. A main use of awk and other unix tools for me is as-hoc data munging, looking through production logs, taking bits of data from different sources and comparing them. If you make eg gui applications and sell them or work as a contractor developing in-house applications for clients, you probably don’t see a lot of production logs like that or process them that way. Any data that you expect to process ought to go into a database and then you can use sql (where joins work much better than the unix join command and you can use the actual structure of the data instead of trying to tease it out with the smallest simplest code you can think of). You might be using a debugger or backtracks or reproduction in test to deal with/investigate production issues rather than starting with logs. On the other hand, many other companies will have big production systems that run on lots of different virtual machines and produce lots of logs and have issues that are hard to reproduce (or maybe the logs are easier), and have lots of data spread about random places such that it may be easier to do the hacky thing for one-off cases or to prototype or whatever rather than Doing things more thoroughly or properly. And this set up will also lead to tools that are designed to fit into the rest of the unix ETL tack by the way they output or input data.

On the second point, I think that even as a die-hard emacs user I wouldn’t recommend that anyone try to use emacs as an ide for something like java (or maybe c++). I’d probably try to use emacs a myself but I think my java-editing experience would be worse than for those people who do it with an IDE. More realistically, I’d try to avoid writing java if at all possible.

masto · on May 2, 2020

In all that time, have you strictly been writing code, on computers that came set up out of the box, that runs on computers managed by other people? This is not an insult, I’m aware such things exist, but I seem to travel in different circles. It may come from a certain era of being into computers as a hobby for a long time before I could call myself a software engineer, and having to do what we now call DevOps necessarily by working for companies that didn’t have the resources to hire a fleet of system administrators. As a result I may live in a bubble where the friends and colleagues I associate with are as comfortable with a command line as they are with an IDE, and frequently jump between the two.

I mention DevOps because it is hard to think of a situation where you’re just coding Java and C# all day that truly requires awk-or-similar. But just the other day I had a situation where I copied the contents of a CD-ROM to a web server — drag and drop on my Mac — and ended up with the files all in upper case. The program I was trying to run was failing because it requested everything with lower-case URLs. There were hundreds and hundreds of files in a bunch of directories, so I fixed it with Perl in about 30 seconds.

I’m curious if that’s the sort of problem you never encounter, or if it is, what you reach for.

Timpy · on May 2, 2020

I'm sure use cases vary to some degree, but awk clicked for me when I had some data that wasn't formatted well. I didn't use it in a script, I just worked in the shell and kept iterating until it looked good, then redirected it into a text file. I use it for quick one-offs like this all the time now.

I've also used awk to get button presses from an input stream on a MIDI controller. For me, I found that the up front cost of learning a few awk commands quickly made it's return on investment.

mauvehaus · on May 2, 2020

This. It's often a good enough tool for solving problems that technically require a specialized tool. Except learning the specialized tool takes an order of magnitude more time than iterating on awk to get a good-enough solution.

I'm not proud to admit it, but I've used awk in a couple places where I should have properly used `expect' instead. Except, I don't know expect well and haven't gotten to the point where it was worth the investment in learning it.

eternauta3k · on May 2, 2020

> I've managed my whole career with ctrl-f, and highlight-all matches.

Imagine you have to do this for 100 files.

saberience · on May 2, 2020

Fair :)

scruple · on May 2, 2020

Are you working in a Linux development environment? Targeting Linux systems? You're not going to come across much of a use for awk if you're in the Windows world. I'm making an assumption based on C#.

I've used perl and awk extensively for all sorts of things. Log parsing embedded device logs to generate reports was a big one for me, that's where I really learned both languages. And they are superb tools for that task. We also used awk quite a bit for configuration parsing as a sort of intermediary between different processes and scripts on those devices.

As I moved into web development, I've found myself using both much less. I haven't touched perl more than once in the last 6 years -- I was given a script to maintain recently but it only needed an hour or two of my time to add a small feature to. That perl script is part of a legacy build system. I still use awk every couple of months but it's just part of some pipeline on the command line.

Wilem82 · on May 2, 2020

Using any kind of management position as justification for anything tech-related won't have the effect you're probably aiming for. Dev going management is always downshifting.

kshacker · on May 2, 2020

So I just wrote one for personal use. I was looking for a duplicate file/photo finder and read some reviews about losing data so I said why would I trust a third party for my data and wrote a dupe finder. Use a hash (md5) to make a match (it is not perfect but it works) and then sort and print to show me the dupes. The first version was 45 lines of code although I could have reduced lines if I tried. You don't need awk strictly but I used it to prettify my output and add some filters. Current code with lots more bells and whistles is less than 150 lines (including white space) of bash and awk. And it works for me and I trust it.

I could set up a database, write multiple layers of code ... but really?

indymike · on May 2, 2020

Well, you might have have got the job done faster with awk (and other tools like it)...

ETL and anything dealing with importing large data sets come to mind.

mauvehaus · on May 2, 2020

I don't, and that's half the problem :-) If I did, I'd probably remember enough perl that I'd have never bit the bullet and learned awk!

To answer the serious question, it's often enough for a first order solution to any problem where you have line and field separated text. Pulling instances of "something weird" out of log files, for instance, is a great use for awk, especially if the fingerprint of "something weird" is scattered through fields in a line in a way that makes grep cumbersome. Or if it spans a couple of lines, especially if you're dealing with a log from a multi-threaded app where there might be irrelevant lines interleaved with the ones you care about.

I haven't written much production-grade awk, but I've often used it as a tool to understand a problem well enough to write a production-grade solution to a problem or fix bugs in application code.

andrewl · on May 2, 2020

The author offers one answer to this in his opening text:

I say it's useful on servers because log files, dump files, or whatever text format servers end up dumping to disk will tend to grow large, and you'll have many of them per server. If you ever get into the situation where you have to analyze gigabytes of files from 50 different servers without tools like Splunk or its equivalents, it would feel fairly bad to have to download all these files locally to then drive some forensics on them.

This personally happens to me when some Erlang nodes tend to die and leave a crash dump of 700MB to 4GB behind, or on smaller individual servers (say a VPS) where I need to quickly go through logs, looking for a common pattern.

DLA · on May 2, 2020

You ever deal with very large sets of large text files? You really CTRL-f through multi-GB files? Have you never needed to transform data between formats or yank out subsets of data for processing? You may have just working on business systems that did not deal with significant data volumes or data you did not control (coming from some other organization, for example).

mongol · on May 2, 2020

Analysis of large log files. Granted, today Splunk and other tools are more used.

scottlocklin · on May 2, 2020

People (me anyway) used perl back in the dark ages of the 90s for the same sorts of things people use python for now. It was really the only option for open source general purpose interpreters for quite a while. People still use it in biotech I think, or were 5 years ago; the text processing capabilities are useful for genomic data.

Awk/sed and the ETL stack that comes with every Unix-like OS (aka od, tr, cut, sort) and all that are superb data wrangling tools. Log file parsing, data cleaning, even actual data science at scale can be done with these tools. They're extremely efficient, as they're designed from an era when pretty much all interesting data was comparable to or much larger than memory size. As such, you can do a lot of stuff with them that most people don't imagine is even possible. FWIIW for high end data scientists; I don't consider knowledge of these tools to be optional at all. Anyone who hasn't used them in their career hasn't worked on serious problems, and is probably the kind of educated idiot who will suggest you do a job in a giant Hadoop cluster that you could easily do on one machine.

mrmrcoleman · on May 2, 2020

I read the first line as "I learned perl in college in 5.6 days"

known · on May 2, 2020

Perl is Awk on steroids :)

Torwald · on May 2, 2020

Perl 6 is when they then took LSD and at a hang-up, they went to the clinic and came back a s two persons. One said: never again, the other: bring it on.

lizmat · on May 2, 2020

Cute analogy. In fact, one said: you're not me. And the other ultimately said: I'm Raku.

In case you didn't know: Perl 6 has been renamed to Raku, using the #rakulang tag on social media.

liammonahan · on May 2, 2020

My absolute favorite example of what's possible with awk is this[0] calculator from Ward Cunningham about splitting expenses on a ski trip. It's a really beautiful little piece of code well-adapted to this problem.

[0] - https://c2.com/doc/expense/

kqr · on May 2, 2020

This is the sort of thing us industry professionals need to study in more depth. It doesn't have the most flexible interface, but it's human-readable, and the implementation is simple, fast, and about as obviously correct as possible.

combatentropy · on May 2, 2020

I glanced through the input and output. When I got to the code, I was startled by how short it was. This page is worthy of its own post on Hacker News. I am now convinced that Ward Cunningham is a genius. I was already pretty sure after reading how he invented the wiki.

cyrusmg · on May 2, 2020

I am trying to parse the program based on the article above and comments on the link you posted.

> The first occurrence of a variable name defines it as that sum. Subsequent occurrences become the stored value.

With this quote in mind - why is second `SUM` reference replaced with `-138.95` and not `221.81` (stored value) ?

EDIT: Never mind, now I see it's an exception on line 3.

tekknolagi · on May 2, 2020

This reminds me of a story of a guy who wrote a whole company internal debit system like this (coffee, meals out, etc) and it turned into a currency. I feel like I either saw it here, or a similar forum. Anyone have a link? My searches have not found it...

stuuuuuuuuu · on May 4, 2020

This one? https://royrapoport.blogspot.com/2011/05/coffee-and-its-effe...

tekknolagi · on May 10, 2020

That's it! You found it! Thank you.

freedomben · on May 2, 2020

This is an excellent blog post. I will refer people to this! He touches on all of the important points.

If you want a more thorough/deep exploration of Awk, I recently gave a talk on it (virtually) at Linux Fest Northwest (LFNW) 2020.

Awk: Hack the planet['s text]! Part 1 (Presentation): https://www.youtube.com/watch?v=43BNFcOdBlY

Awk: Hack the planet['s text]! Part 2 (Exercises):https://www.youtube.com/watch?v=4UGLsRYDfo8

If you want to try your hand at the exercises, they are on github: https://github.com/FreedomBen/awk-hack-the-planet

augustk · on May 2, 2020

It's also worth mentioning that local variables can be simulated using additional formal parameters. In AWK, any missing parameter in a function call is initialized to zero.

Let's say we have a function CharCount which takes a character and a line of text and returns the number of occurrences of that character:

    function CharCount(ch, line,
        n)
    {
        ...
    }

The line break in the parameter list is an AWK convention and indicates that n is a "local variable."

ratsmack · on May 2, 2020

I thought the convention was to separate local variables with three spaces like this:

    function CharCount(ch, line,   n)

Marcus316 · on May 2, 2020

I love using awk. It was fairly easy to pick up, and it slides into my command line workflow pretty well.

That said, there's an amazing amount you can do with it, if you really try. Someone once joked that I should try building an IRC bot in AWK, so I did: https://github.com/Marcus316/rufus

There's no pactical reason for it, but it was fun to play with the idea.

arendtio · on May 2, 2020

Very nice conclusion of the language. However, keep in mind that awk is not the fastest language around. A pro awk thread might not be the best place to tell this story but it is fresh and true:

Last weekend I was playing around with some data. At first, I thought 'let's just write a line of awk and be done with it' and so I did. The execution took 20 seconds (about 17 million lines) and everything was fine.

Later that day, I came across another task which seemed too complex for an awk one-liner so I took two lines of R and was surprised when R was done within 5 seconds on the same data set.

I was happy because I found a faster tool than the one I had, but the lesson is, that just because you use a proven tool like awk, doesn't mean there aren't any better tools. Find out what works best for you.

tannhaeuser · on May 2, 2020

awk is a language with multiple implementations, so we can only talk about performance of a particular awk implementation. Which is especially true for awk because mawk (based on a vm rather than being a traditional interpreter) is so much faster than other awks if you can live with its limitations [1]. If you're using gawk, setting LANG=C also helps performance because no utf-8 honoring needs to be done. Speaking of which, I've noticed gawk on recent Ubuntus (19.10) appears broken and/or has regexp size limits breaking awk code with large regexpes (but haven't checked yet thorougly).

[1]: https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-...

arendtio · on May 2, 2020

Arch Linux

  $ awk --version
  GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)

arendtio · on May 2, 2020

Since others were suggesting mawk too, I just installed it and it was indeed faster than GNU awk: The task took about 7 seconds.

Interesting enough: LANG=C increased the performance of GNU awk a bit, but actually decreased the mawk performance.

dev_tty01 · on May 2, 2020

Was it a faster tool? You had a need, you tossed awk at it and accomplished your goal. Did the 20 seconds vs. 5 seconds matter? The time you spent deciding what to do and how to do it took more time. For a one-off, grab some data type of task, the tool that you know well will almost always be the fastest because you get to the end result the quickest.

I agree with the comment below. If you haven't already, try mawk instead of awk. It is often many times faster than awk (and other solutions).

wodenokoto · on May 2, 2020

What kind of task was it? I’d imagine loading a csv-like file and doing some somewhat heavy calculations, R would win, but when I think of AWK problems I think of text manipulation, and when I think of manipulating text I do not think of R

Disclaimer: I don’t actually write any AWK, but learning it is on my bucket list.

arendtio · on May 2, 2020

The tasks were different.

AWK -> calculate the difference between every two lines:

  awk -F ',' 'NR!=1{printf "%.0f\n", $1-ll}NR==1{print ""}{ll=$1}'

R -> Calculate the min, median, mean, and max:

  d<-scan("stdin", quiet=TRUE)
  cat(min(d), max(d), median(d), mean(d), sep=" ")

AFAIK, calculating the difference to its previous line for every line should be faster for data that is not sorted. I didn't try to write that one in R though.

I am sure there ways to make both things faster, I was just surprised as I didn't expect R to be faster with those two naive implementations.

ratsmack · on May 2, 2020

I have never found anything faster then mawk. GNU awk (gawk) is orders of magnitude slower. For a proper comparison you would have to note which implementation of awk you are using.

geocrasher · on May 2, 2020

Whenever somebody says "Oh, I don't know how to use awk", this is the link I send them. It's easily the most useful tutorial for awk on the 'net. There's a lot more you can do with awk, but this site shows you the top percentage in a very short amount of time.

dwoot · on May 2, 2020

It's crazy how this reappeared on HN only hours after I went searching for awk on Hacker News. Fred Hebert is so well known in the Erlang and Elixir communities.

I have just coined this idea the "gateway drug" approach to learning new tools. We should strive to find those introductions that are small enough that someone can digest without a massive upfront time investment to get them past the front door :)

hackermailman · on May 2, 2020

I learned awk from the chapter on filters in the pamphlet sized book 'unix programming environment' by Kernighan and Pike you can buy for $8

combatentropy · on May 2, 2020

I've heard many good things about that book, which is why I bought it several months ago. From the bits I have read, I like it. But I have not yet finished it, partly because it is not pamphlet sized. It is over 300 pages. But I look forward to the chapter on Filters.

daenz · on May 2, 2020

Very cool and I'm grateful for Awk to be presented in such a friendly way.

However, at some point, it makes more sense to write your Awk script in something like Python, and my intuition says that that time is shortly after starting a basic Awk script. Using a real programming language is almost always the way to go with code that will become more complex over time (almost all code) and code you have to share with a team (learning curve).

kazinator · on May 3, 2020

By the way, Awk is whitespace sensitive:

   pattern { action }  # valid

   pattern {           # valid
     action
   }

   pattern             # not what you think
   { action }          # this action has no pattern

There cannot be a newline between pattern an action. This feature allows either the pattern or the action to be omitted without ambiguity.

   pattern        # pattern with default { print } action
   { action }     # unconditional action
   pattern {      # pattern with action
      action
   }

Items can be put on a single line without ambiguity using semicolons:

   pattern ; { action } ; pattern { action }

A pattern can have multiple patterns separated by a comma. That syntax admits optional line separation after the comma separators:

   pattern,
   pattern,
   pattern {
     action
   }

The action fires by a match for any of the patterns.

The POSIX standard Awk expression grammar has no comma operator, on the other hand; the comma exists only for separating patterns, and function arguments/parameters, and in the print statement syntax.

Plus, JNIL (just now I learned). The in operator allows comma-separated expressions for testing "multi-dimensional" array membership. Here is a hello, world:

  $ awk 'BEGIN { a[1,2,3] = 4 ; print (1,2,3) in a }'
  1

(Note that multi-dimensional arrays in Awk are simulated; a string index is generated for that 1,2,3).

gumby · on May 2, 2020

One trick I use with awk, especially throwaway ones, is to use grep to subset the data before feeding it into awk. Often the case is using awk to poke at some data to diagnose a problem, not write a script to be run often or stuck in cron.

So to use an example from this nice short article: I might do grep GET log-file | awk blah-blah. Then awk doesn’t need to consider the lines I don’t care about. This is especially useful when iteratively writing the awk script.

yiyus · on May 2, 2020

I find easier to do awk /GET/{blah-blah}

gumby · on May 2, 2020

Oh yeah, sure! I should have been clear that when dredging a really large log file or such it can be faster to run awk on a much smaller subset when your awk program has a lot of productions. I find it easier to think "OK, just run awk on these lines that matter."

Six of one half dozen of the other.

alexhutcheson · on May 2, 2020

There are two use-cases I’ve run into where awk really shines, and is hard to replace:

1. Writing scripts for environments that only have Busybox. Technically you can write scripts in ash, but I don’t recommend it for anything beyond a couple lines. It’s missing a lot of the features from Bash that make scripting easier, and it’s easy to get mixed up if you’re used to Bash and write things that don’t work. Awk is the best scripting language available, even if you’re doing things that don’t exactly match what it was designed to do.

2. Snippets that are meant to be copy+pasted from documentation or how-to articles. In that case, it’s often not easy to distribute a separate script file, so a CLI “one-liner” is preferred. You also can’t count on Perl, Python, etc. being available on the user’s system, but awk is pretty universal.

For most other cases, I tend to create a new .py file and write a quick Python script. Even if it’s a little more overhead, it helps keep my Python skills sharp, and often it turns out that what I actually want is a little more complicated than my initial idea anyway.

penguinjeff · on May 2, 2020

Great post. I didn't know Fred had a blog. Fred's writing style never ceases to amaze me. His book on property based testing in Elixir was a better introduction for me to Python's Hypothesis than any of the other tutorials I found online on the topic because it made me think about discovering properties of my code. Would recommend it to others.

jftuga · on May 2, 2020

One of the things I like to do with mawk.exe under Windows is to automate the same command over a group of files.

Let's say the myPgm only takes one file name as a command line parameter, then I can so something like this:

    dir *.xyz /s/b | mawk "{print 'myPgm -x '$0}" | cmd

If the file paths have spaces in them, then you have to wrap the name within double quotes. I have found it challenging to output those in mawk -- at least easily any way. In this case, I have a windows version of tr.exe

In this case, I can do something like this:

    dir *.xyz /s/b | mawk "{print 'myPgm -x ~'$0'~'}" | tr ~ \042 | cmd

Although somewhat crude, it is effective.

spapas82 · on May 3, 2020

Why not use good ol' for?

F.e for %f in (.doc .txt) do type %f

jftuga · on May 5, 2020

I also need to do things like:

    docker container ls -a -q | mawk "{print 'docker rm '$1}" | cmd

bcyn · on May 2, 2020

So, I don't use awk right now. This answers "how to awk" more than "why use awk" for me (understandable given the 20 min claim).

Does anyone have concrete, practical examples of use cases where awk made your life much easier?

AceJohnny2 · on May 2, 2020

I use awk when I need something a bit more powerful than "grep"/"cut", but don't want to pull out the big guns with Perl.

For example, recently I needed to print out certain fields of an output, but only for a given subset. So I used Awk to create a simple state machine (enable when I see the start of the subset, disable at the end), and print the fields of interest.

rmetzler · on May 2, 2020

The grep/cut replacement is exactly what I use awk the most for. I use it for other things too but this stands out. awk just fixes the problem when you need to have more than one field from the input and maybe a different string between them. I guess this can be done with cut too, but for me it's simpler this way.

Also I use awk often in scripts together with fzf to interactively switch contexts e.g. in cloud provider CLIs.

AceJohnny2 · on May 2, 2020

Or you could use it to solve the Towers of Hanoi game ;)

https://rosettacode.org/wiki/Towers_of_Hanoi#AWK

ggm · on May 2, 2020

Awk '{ print $2 }' does LWSP gobbling.. cut -d' ' -f2 doesn't and this difference alone makes awk useful to me on a daily basis.

Awk has a hash like perl which is very efficient. I use an awk expression to print uniq as they come in counted through the hash insert on new instead of uniq which prints at end.

Awk count unique over 300,000,000 ips was as fast as perl and python and smaller memory footprint

swimfar · on May 2, 2020

LWSP=Linear White Space "Linear white space is: any number of spaces or horizontal tabs, and also newline (CRLF) if it is followed by at least one space or horizontal tab." [1]

I don't know what LWSP gobbling is, though. Google didn't help.

[1] https://stackoverflow.com/questions/21072713/what-exactly-is...

pstuart · on May 2, 2020

It means a contiguous collection of arbitrary white space characters are considered to be a single delimiter.

With the parent example it means that column #2 could have any of number whitespaces between it and column #1.

ggm · on May 2, 2020

yes. whereas cut, considers ' ' (two spaces) as an instance of a blank column separated by the -d' ' space instance. Its like ,, in CSV being a blank column.

skrebbel · on May 2, 2020

> Awk '{ print $2 }' does LWSP gobbling

That's like answering "why use Haskell?" with "Because its monads are monoids in the category of endofunctors"

ggm · on May 2, 2020

This made me very happy, since it goes to the 'you can either use awk or explain awk but not both' problem which is definitional in monads in Haskell. I thought LWSP was a thing. I forgot not everything is a thing. And gobbling is maybe very jargon ish. Python sys.stdin.rstrip().split() comes to mind.

xorcist · on May 2, 2020

For the simplest uses, cut is more readable.

If argument expansion is required, consider using read. That way the argument is given a name. Use "read a b c < x" instead of "b=$(cat x | awk '{ print $2 }')".

Just remember to pipe to read with care, as the right hand part of a pipe is a subshell in which variables are local. So "read a b c < <(echo 1 2 3)" works, "echo 1 2 3 | read a b c" doesn't.

Another way to expand arguments is to simply define a shell function and use $2. Something like "process_line() { echo $2 }".

When you have to reach for something like awk, your script would probably improve by being mostly awk. In which case most people are probably looking at perl or python anyway. Even if awk is a nice language, the arrival of perl mostly killed it. It is not wrong to say perl was the next version of awk.

arexxbifs · on May 2, 2020

tr will handle the whitespace for you and is generally faster[0] (but awk is faster to type and reads better).

[0] https://datagubbe.se/cutvawk/

chme · on May 2, 2020

If you just use awk for `{ print $2 }` then I would still prefer `tr -s ' ' | cut -d ' ' -f2`, since both are part of coreutils and with `awk` you would add a additional dependency to your script.

asicsp · on May 2, 2020

that is not equivalent because awk will remove trailing/leading space/tab/newlines (newlines come into play with a different record separator)

whereas, tr will still leave a trailing/leading space

for example:

    $ echo '   a    b  c   ' | tr -s ' ' | cut -d ' ' -f2
    a
    $ echo '   a    b  c   ' | awk '{print $2}'
    b

and by default awk splits on space/tab/newlines, whereas in cut example above, you get only space as delimiter

cut has its uses and will be faster than awk, but it depends on the problem being solved

chme · on May 2, 2020

You are right it depends on the specific use case.

It possible to use `xargs -L 1` to trim the separators, but then you would also add `findutils` as deps.

I just wanted to point out that keeping the dependencies of scripts in mind when programming is also important.

joosters · on May 2, 2020

IMO, it would be great if 'cut' could accept more than just a single character for delimiters.

e.g. a regex or even a shell-style pattern match would be very useful. `cut -d ' +' -f2` would get rid of the need for 'tr' in your example.

Even just allowing >1 char delimiters would be helpful. A text list like 'foo, bar, foobar' is too much for cut, but it would be fine if cut accepted parameters like `cut -d ', '`

em500 · on May 2, 2020

Why would being part of coreutils matter? Awk is part of POSIX, just like tr and cut. Even in very constraint environments you can count on it via busybox.

ggm · on May 2, 2020

A perpetual argument is use of too many pipe separated distinct commands. The argument suggests it's lazy to use sed | awk | grep type pipe runs because in all probability sed or awk alone could have done it and you incurred two excess fork/exec() and therefore consumed kernel and userspace beyond your need. The usual rejoinder is "get nicked"

chme · on May 2, 2020

A lot of stuff is part in POSIX, but not all of that is available on every system. Also busybox configurations can vary a lot.

billjings · on May 2, 2020

I wrote my answer to this a few years ago:

https://www.bignerdranch.com/blog/a-crash-course-in-awk/

A tl;dr for the following tl;dr: it's great for quick-and-dirty processing of logs.

The tl;dr is that AWK is simple language that lets you use a line-oriented event driven programming model to process text. This abstraction is simple enough to be readable and maintainable, but powerful enough to parse and process text with line-oriented structure.

The example linked elsewhere is worth studying, because it truly is one of the prettiest pieces of programming I've seen in a long while:

https://c2.com/doc/expense/

If you don't know anything about AWK, here's all you need to know to grok this:

1. An AWK program is a list of [event] { code } pairs. All pairs are run in sequence on each line of input; if the "event" evaluates to true or is a matching regex, the matching code runs.

2. { code } on its own runs unconditionally.

3. The input is automatically whitespace delimited into fields; $1 refers to field 1, $2 to field 2, etc

4. NF is a special value that yields the number of fields

5. Assigning to fields replaces the text in those fields.

6. ($1+0) != 0 implicitly converts $1 to an integer; if it fails, the value is 0.

There is a lot of implicit loose typing in the language, and usage of undefined variables is idiomatic. Functions aren't easy to use, either. So it's not well-suited for programming in the large, or even the medium. But for the scale of programming seen in that link, it's truly a wonderful and simple power tool.

gen220 · on May 2, 2020

I use awk to write csv-parsers for my personal finance setup. They take transaction logs and convert them into a ledger format (https://www.ledger-cli.org/). There's some pretty trivial if/else branches based on regexp's, which awk is pretty good at expressing.

I'd normally write this sort of thing in python. The awk program is usually smaller than a comparable python program, but for me the main selling point is that awk programs a more "UNIX-pipe-native" than python programs are.

At some point, I'll probably rewrite these programs in a more "serious" programming language, once the file sizes get too big or something, but for now they're working great and are easy to extend.

mauvehaus · on May 2, 2020

I've used it for this too, but things tend to go south pretty quick if you have escaped or quoted commas. That, unfortunately, is where I usually break down and pull out python.

gen220 · on May 2, 2020

hehe, I hit that bump in the road too, but maybe I was too stubborn.

FWIW, you can get around it. I just took a peek because I couldn't remember how to do it off the top of my head.

It looks like I had to use

  gawk -F ',' -v FPAT='([^,]+)|("[^"/]+")'

to get the behavior you're looking for. Seems like I nabbed it from https://www.gnu.org/software/gawk/manual/html_node/Splitting...

Agreed that this is the sane point at which to pull out python.

mauvehaus · on May 2, 2020

I hadn't run across that one yet. Thank you!

The risk of that seems to be if CSV allows embedded escaped quotes in a quoted string. Does it? I don't know. And CSV is pretty loosely defined. For most people it's probably "whatever Excel emits or ingests".

And I think that's why we're on the same page about pulling out python and using a module where somebody has explored what the corner cases are and dealt with them for us already.

gen220 · on May 2, 2020

Yea CSV is famously bespoke, I don’t know if it’s ever been standardized?

In my experience, each implementation is more or less unique. Even if you allow escaping, different formats permit different methods of escaping and you might have to account for each! shudders

It’s much easier to support that in python than it is in awk with regexp. The awk route will eventually make you a regexp wizard, though, which confers it’s own benefits. :)

asicsp · on May 2, 2020

Here's some articles

* https://blog.jpalardy.com/posts/why-learn-awk/ (also discussed on HN: https://news.ycombinator.com/item?id=22108680)

* https://adamdrake.com/command-line-tools-can-be-235x-faster-...

_ofdw · on May 2, 2020

I contribute to a niche game server project and used awk to generate c# class files from an org mode document that I wrote to specify them.

bcyn · on May 2, 2020

What about awk makes that easier than, say, a Python script?

7thaccount · on May 2, 2020

Awk automatically operates on every line of a text file, so you can easily use it as part of a pipeline in Linux. I'll usually use grep, cut, sort, and awk all together with one line of code and no need to open an editor. Python takes a lot more code to do these basic things. However, Python is much better once the complexity grows past a certain point.

asicsp · on May 2, 2020

depends upon the usecase, for relatively simple column filtering, substitution, etc, awk script would be briefer than python equivalent and most likely be faster as well

if it becomes lengthy (and again depends on features required), then Python is likely to have inbuilt/3rd-party libraries to make it easier to write and maintain

if it is a question of constructing command line one-liners and using it as a part of other cli tools, then awk wins easily

for example:

    awk -F'\\W+' -v OFS=, '{print $NF, $2}' input.txt

prints last column and second column, where non-word characters form the field separator and comma is used as output field separator

_ofdw · on May 2, 2020

Processing text files line by line is trivial in awk. Maybe it's easy in Python too but at least in awk you don't have to worry about the 2 vs 3 silliness.

Plus I think semantic indentation is not sane language design.

Plus you can pipe it to another command with ease.

dwheeler · on May 2, 2020

Nice intro. Awk is a useful tool when you want to do simple line-by-line processing.

Note: The article says that awk patterns can't capture groups. The standard doesn't provide that functionality, but if you use the widely-available gawk implementation, gawk does have that capability (use "match").

marcinzelent · on May 3, 2020

Last year, I attended a tech event where Rasmus Lerdorf (the creator of PHP) was a guest speaker. When asked about his favorite programming language, he said: AWK. He didn't elaborate why but looking at this article, I think there is something to it.

spsrich2 · on May 2, 2020

it's amazingly useful. I have used it since 1987, most recently today.

globular-toast · on May 2, 2020

I recommend this article to people all the time and often come back to it myself. I don't use awk as much these days but when I worked in bioinformatics I was fluent and loved it.

xwdv · on May 2, 2020

Every time I set out to use awk for some one time task I struggle to express what I need and end up saying fuck it and fire up vim and do exactly what I want with some quick macros.

Am I the only one?

yaktubi · on May 2, 2020

I usually say duck it and dust off my shiny Perl necklace

Noumenon72 · on May 2, 2020

I usually persevere and get it done in awk but then find the next time I need awk that I don't remember any of it and can barely understand my own examples.

j_z_reeves · on May 2, 2020

nice, I finally took the time to read the man pages for awk. And whipped out a script to count the number of errors occurred for a particular day for a postgres log file.

   cat logfile | awk '/ERROR:/ {counts[$1] = counts[$1] + 1}; END { for (day in counts) print day " : " counts[day]}' | sort

I just needed to know how awk programs are structured, the rest is just simple programming!

EDIT: I'm not sure if it's actually correct however...

tejtm · on May 2, 2020

j_z_reeves 9 hours ago [-]

> cat logfile | awk '/ERROR:/ {counts[$1] = counts[$1] + 1}; END { for (day in counts) print day " : " counts[day]}' | sort

Great first program! a bit less verbose could be

> awk '/ERROR:/ {counts[$1]++}END{...}' logfile

there are also ways of sorting the output but within (g)awk (asort & asorti) but sorting externally as you have is more flexible and engages another core which can be faster on large input

j0057 · on May 2, 2020

Without knowing the format of logfile, that still seems obviously correct to me.

j_z_reeves · on May 2, 2020

I was surprised by the counts[$1] = counts[$1] + 1, since I didn't think it would correctly coerce a non-existing value to a 0.

xorcist · on May 2, 2020

Apart from the useless use of cat, since sort does the work here something like the following would probably suffice:

grep ERROR logfile | cut -f 1 -d ' ' | sort | uniq -c

j0057 · on May 3, 2020

There's really nothing useless about that use of cat: it makes the pipeline compose better from left to right. It's not like you have to pay 25 cents for each process you spawn.

xorcist · on May 9, 2020

So does the pipeline above.

It's not detrimental to performance since an empty cat is a no-op in a pipeline. You can have any number of them. But commands should be written for humans to understand, and inserting no-ops is a distraction to the reader.

In the trivial example, "grep needle haystack" reads better than "cat haystack | grep needle".

j_z_reeves · on May 2, 2020

yes that would also work! I forgot about the `-c` argument for uniq.

dang · on May 2, 2020

Discussed at the time: https://news.ycombinator.com/item?id=8893302

aidenn0 · on May 2, 2020

This is a great intro; I have the GNU Awk user's manual bookmarked because there are a lot of features in gawk you will only rarely use but are quite useful.

gabrielsroka · on May 2, 2020

bwk talking about awk, C, etc: https://youtu.be/Sg4U4r_AgJU

dmux · on May 2, 2020

The pattern-action paradigm is really simple to understand and I suspect it's what made Sinatra style web frameworks stand out.

rafaele · on May 2, 2020

Good tutorial. I like the concise survey of the language components. This is going to make working with awk a lot less awkward.

fithisux · on May 3, 2020

This is the manual I needed the last 20 years.

4m1rk · on May 4, 2020

Could remove line 32 and just put `{ flag = 0 }` before line 31, right?

coryodaniel · on May 2, 2020

One time I had a file and I needed all the values in a column and so I used:

  awk '{ print $2 }' my-file

And it gave me what I wanted. It was cool.

js8 · on May 2, 2020

To use Awk I tended, up on the ward I ended, it was awkward.

jzer0cool · on May 2, 2020

sed grep & awk :)

known · on May 3, 2020

#to print duplicate lines

    awk '++seen[$0] > 1' filename.txt

blackrock · on May 2, 2020

What’s better than awk, are one-liner python programs.

Then, you can even alias it.

known · on May 2, 2020

Common lines between 2 files

awk 'NR==FNR{a[$0]; next} $0 in a' colors_1.txt colors_2.txt

http://archive.vn/mmd80

asicsp · on May 2, 2020

I've recently modified and published that tutorial as an ebook [1], which can be read from the github repo or downloaded as PDF (currently free). I am also updating the book to include exercises, other minor improvements as well as epub version - expected release by next weekend.

[1] https://github.com/learnbyexample/learn_gnuawk

mkl · on May 2, 2020

The link doesn't load for me, but that seems pretty unreadable. I'd probably sort, merge, and print duplicates:

  sort -m <(sort -u file1) <(sort -u file2) | uniq -d

asicsp · on May 2, 2020

you can use comm as well

    # common lines 
    comm -12 <(sort file1) <(sort file2)

    # lines unique to first file
    comm -23 <(sort file1) <(sort file2)

    # lines unique to second file
    comm -13 <(sort file1) <(sort file2)

regarding readability, it is the same with any new tool or programming language, you'd need to be familiar with its syntax and idioms, someone not familiar with command line and sort/uniq commands will find your solution as alien as well

soufron · on May 2, 2020

Awk in 2 minutes : google "awk this" and "awk that" :D