Over a decade ago, the maintainer of SQLite gave a talk at OSCON about their testing practices. One concept that stood out to me was the power of checklists, the same tool pilots rely on before every flight.
He also mentioned Doctors Without Borders, who weren't seeing the outcomes they expected when it came to saving lives. One surprising reason? The medical teams often didn't speak the same language or even know each other's names.
The solution was simple: a pre-surgery checklist. Before any procedure, team members would state their name and role. This small ritual dramatically improved their success rates, not through better technique, but through better communication.
I've always found an enormous amount of good practices (not just engineering ones) in aircraft operations and engineering that would be applicable to software engineering.
I've always day dreamed of an IT organization that combined those with the decision-making procedures and leadership of modern armies, such as the US Army one.
I've re-read multiple times FM22-100 which I find strikingly modern and inspiring:
While I do understand that business leadership cannot compare to the high standards of those required by way more important stakes, I think there's many lessons to learn there too.
It's all about balance, in the end. If you do too much of one thing your business will fail. If you don't do enough of another, your business will fail too. And they're the same thing...
The trick then is to do just enough of everything to avoid disaster and to move as fast as you can to get to a realm where you can actually afford to do it right. Most start-ups initially cut corners like it is crunch time at the circle factory, which then usually catches up with them at some point either killing them or forcing them to adapt a different pace. Knowing exactly when to put more or less attention on some factor is the recipe for success but nobody has managed to execute that recipe twice in a row without finding things that no longer work, so it remains a dynamic affair rather than one you can ritualize.
And that's where checklists shine: repeated processes that are well defined and where change is slow enough that the checklists become 'mostly static', they still change but the bulk of the knowledge condensed in them stays valid over multiple applications.
In some areas, I absolutely agree... I think when it comes to vehicles, medical devices and heavy equipment, it would be better to see much more rigorous practices in terms of software craftsmanship. I think it should be very similar in financial operations (it isn't) and most govt work in general (it isn't).
In the end, for most scenarios, break fast, fix fast is likely a more practical and cost effective approach.
On the other hand, I think this narrative also causes a lot of useless red tape. There might be some survivorship bias here.
Aviation, Doctors Without Borders, and SQLite have good checklists. Checklists are simple, so it's easy to think "oh I could do that too". But you never hear about the probably endless companies and organizations that employ worthless checklists that do nothing but waste people's time.
I wish there was more talk about what makes a checklist good or bad. I suspect it's kind of like mathematics where the good formulas look very simple but are very hard to discover without prior knowledge.
I make checklists for myself and they're enormously helpful. Because my brain can't always remember every single little detail of every complex task every single time.
I've also seen checklists made by morons that are enormously unhelpful.
IMO it's paramount for whoever is making the checklist to have familiarity with the task at hand (both how to do it properly, and what steps people tend to miss or get wrong), investment (is this tool something you'd find indispensable for yourself if you were placed in the role of executing it?), a sense of pragmatism and conciseness.
The ability to recognize what things will be obvious or flow naturally from A to B helps eliminate redundant fluff. e.g. I train volunteer firefighters and in most canonical steps for calling a Mayday, one is basically "Tell the person on the other end what's wrong". You don't need a checklist item for that. When something goes seriously sideways and you need help, you will be very inclined to convey what is the matter.
> But you never hear about the probably endless companies and organizations that employ worthless checklists that do nothing but waste people's time.
Most if not all the bad checklist I have encountered are all for the same reason, they were not tested or poorly written, and most of the time both.
Not tested in terms the checklist was written by somebody who doesn't actually know how to do the whole project. Unlike Professionals like Doctor ands Pilot where they are well trained and the check list are well understood to be a reminder. The rational behind it were taught and even if not professionals will question if something they dont understands while most other in there field could immediately give a detail answer.
Another example would be HR writing an on-boarding checklist. 99% of the time I have seen those check list are intended to make HR's life easier. Not the candidate or applicants.
Checklist is also a clear and distilled form of writing. And as the saying goes I dont have time to write you a short letter, but I have time for a long one. Writing short points with clarity takes a long time. And not a skill set everyone process. Nor do they have the time to do it when it is not part of their job or KPI.
WHO and Gawande emphasize iteration, that the draft is always wrong. They also claim good checklists are really coordination tools disguised as task lists.
While I certainly found it insightful, I felt like this book (like so many in the genre) was a pamphlet's worth of material inflated to fill about 250 pages.
It's true that you can boil it down a lot. In fact, the book even has a checklist checklist that distills down the advice to one page. However it was overall a very quick read and the extra discussion really did further my understanding of the underlying principles that make a checklist good. I'd recommend reading the whole thing so that you actually make a useful checklist instead of a cargo-cult copy of an aviation checklist.
The thing that drives me absolutely mental about most developers I’ve worked with is just how much work they’ll do to avoid the easy thing, if the easy thing isn’t programmatic.
I have tests and CI and all that, sure. But I also have a deployment checklist in a markdown document that I walk through. I don’t preserve results or create a paper trail. I just walk the steps one by one. It’s just so little work that I really don’t get why I cannot convince anyone else to try.
Manual checklists are often the best option for repeated tasks that can't be automated sufficiently reliably and sufficiently economically. But if they can be, then manual checklists are unnecessarily inefficent and/or unreliable. And the more frequently repeated the task is (ceteris paribus), the more up-front energy is justified in automating it. That said, to automate a process, you have to understand it enough to generate a checklist as a prerequisite (and, sure, you can develop that understanding in the course of automation, but doing so first will also go a long way to informing you if automation is likely to be worthwhile.)
That said, and without prejudice to SQLite’s use of checklists which I haven’t deeply considered, while the conditions that make checklists the best choice are definitely present in aviation and surgery in obvious ways, processes around software tend to lend themselves to efficient and reliable automation, with non-transitory reliance on checklists very often a process smell that, while not necessarily wrong, merits skepticism and inquiry.
Shoutout to Dr. Atul Gawande's excellent book The Checklist Manifesto, an expansion of his New Yorker article [0]. One of his main points is that even the most competent people forget stupid stuff. He illustrates with examples from surgery, from aviation, from the construction industry, and others. He quotes a saying that aviation checklists are "written in blood."
That is really insightful regarding the ritual improving outcomes through better communication - something I see reflected in many meetings I turn up to now which involve an introduction round between participants, and anecdotally improves participation in the meeting.
It would be amazing if someone had a link to a page with the MSF story, as that is a great reference to have! My google-fu hasn’t helped me in this case.
It's just a shell script wrapped around $EDITOR and git. The intent is to write checklists in the style of github-flavoured markdown. It has some tricks such as envsubst(1) interpolation and embedding little scripts whose execution is captured alongside the execution itself.
Here's an example checklist that's fairly well-worn (though is somewhat dated):
Where checklist entries are commands I just copy/paste them into a shell session. Usually this is in a tmux split with the checklist on one side with a shell on the other.
Could more of it be automated? Perhaps, though when some of the steps fail for various reasons, I find it's easier to recover or repair the situation if I have been invoking them myself sequentially. The embedded script support enables progressive automation where stability of the results is demonstrated over time.
The stupid answer is that not everything that can be automated should be.
The real answer is of a more philosophical nature, if you manually had to check A, B, C... Z, then you will have a better understanding of the state of the system you work with . If something goes wrong, at least the bits you checked can be disregarded and free you to check other factors. What if your systems correctly report a faulty issue, yet your automatic checklist doesn't catch it?
Also, this manual checklist checks the operator.
You should be automating everything you can, but much care should be put into figuring out if you can actually automate a particular thing.
Automate away the process to deploy a new version of hn, what's the worst that can happen?
But don't automate the pre flight checklist, if something goes wrong while the plane is in the air, people are going to die.
I think a less verbose version of the above is that a human can detect a fault in a sensor, while a sensor can't detect it is faulty itself.
I'm not a pilot, but my brother is, and I watched him a bunch of times go through these before takeoff and landing. I think it's about more than automation, these days the aircraft computer "walks" the pilots through the checklists but it's still their responsibility to verify each item. I think it's an interesting approach to automation, keeping humans in the loop and actually promoting responsibility and accountability, as in "who checked off on these?"
Someone checks that they ran successfully, and vouches for it.
Automating the automation can be counter productive.
Like the release process is triggered automatically by a tag, then fails after an hour long sequence of complex steps, which forces you to re-tag, but by then your tag is out there.
Or, simply, it's a bad idea to run the entire process from scratch, but you automated it such that it's easiest, so you fix something about it and the only way to test the release process itself is to release, and you now need half a dozen releases to get it right.
Checklists that I use in personal life:
- Office packing list. A “do-check” checklist that takes 20s to run through right before leaving home
- Checklists for multi-day business and leisure trips
- Home maintenance checklist for filters, drains and other things that require regular maintenance.
He also mentioned Doctors Without Borders, who weren't seeing the outcomes they expected when it came to saving lives. One surprising reason? The medical teams often didn't speak the same language or even know each other's names.
The solution was simple: a pre-surgery checklist. Before any procedure, team members would state their name and role. This small ritual dramatically improved their success rates, not through better technique, but through better communication.
https://sqlite.org/src/ext/checklist/3070700/index