I took the "this sounds like Crowdstrike" tack for two reasons. The write-up characterized this update as an every five minutes process. The update, being a file of rules, felt analogous in format to the Crowdstrike signature database.
I appreciate the OSPF analogy. I recognize there are portions of these large systems that operate more like a routing protocol (with updates being unpredictable in velocity or time of occurrence). The write-up didn't make this seem like one of those. This seemed a lot more like a traditional daemon process receiving regular configuration updates and crashing on a bad configuration file.
It is possible that any number of things people on this thread have called out are, in fact, the right move for the system Cloudflare built (it's hard to know without knowing more about the system, and my intuition for their system is also faulty because I irrationally hate periodic batch systems like these).
Most of what I'm saying is:
(1) Looking at individual point failures and saying "if you'd just fixed that you wouldn't have had an incident" is counterproductive; like Mr. Oogie-Boogie, every big distributed system is made of bugs. In fact, that's true of literally every complex system, which is part of the subtext behind Cook[1].
(2) I think people are much too quick to key in on the word "config" and just assume that it's morally indifferentiable from source code, which is rarely true in large systems like this (might it have been here? I don't know.) So my eyes twitch like Louise Belcher's when people say "config? you should have had a staged rollout process!" Depends on what you're calling "config"!
I just want to point out a few things you may overlooked. First, the bot config gets updated every 5 minutes, not in seconds. Second, they have config checks in other places already ("Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input").
They could probably even align everything in CI/CD if they'd run the config verifier where the configs are generated. This is of course all hindsight blind guessing, but you make it sound a bit arcane and impossible to do anything.
I took the "this sounds like Crowdstrike" tack for two reasons. The write-up characterized this update as an every five minutes process. The update, being a file of rules, felt analogous in format to the Crowdstrike signature database.
I appreciate the OSPF analogy. I recognize there are portions of these large systems that operate more like a routing protocol (with updates being unpredictable in velocity or time of occurrence). The write-up didn't make this seem like one of those. This seemed a lot more like a traditional daemon process receiving regular configuration updates and crashing on a bad configuration file.