Coming from a software engineering perspective there is a certain amount of toil...

Coming from a software engineering perspective there is a certain amount of toil which is impossible to automate away. CI break-fix issues often depend on the surface area of your software as it interfaces with third parties, including the CI system itself. In some cases that surface area can be large and break-fix takes up a considerable amount of time, but that toil is not _repetitive_ and is _necessary_ table stakes based on the system.

And this is after having someone who is extremely aggressive with automation and empowered to do whatever they like to reduce that surface area working on the system. I've taken codebases and hacked out 60% of the lines of code in order to remove brittle external surface area along with unnecessary requirements and contain the project better within its own boundaries and stop repetitive issues. I've taken clever ideas that someone had 5+ years ago out behind the barn and shot them in order to reduce total surface area.

But people can walk into an area with a lot of toil going on and go "oh, I know all the strategies on how to reduce this, I will explain to these people who clearly aren't as clever as me how to do it" without realizing that there's often a minimum level of toil for a project which you can't effectively reduce. There's a nonzero vacuum expectation value of toil in any project, and in some cases it can be quite large. Inherently.

I don't know how many managers I went through who would come and decide to document all the different failures we were having and spreadsheet them and look for the patterns to address them. And every week there would be 2-3 that would come up and they'd struggle with the fact that there was really no pattern, other than that the project inherently touched many different third parties, because it really HAD to, and that those third parties would change, which would then force interrupt driven toil.

There's some point where you just have to hire more people and spread it out. There's no magical incantation to manage your way out of additional headcount.

And I don't think the OP article even touched on re-enginering to reduce surface area and brittleness. Automation isn't the only answer to toil. You can automate restarting a service if it crashes, but its always better to just fix the bug (which may involve fixing architectural issues) and make it stop crashing in the first place.