> Today, many friends pinged me saying Cloudflare was down. As a core developer of the first generation of Cloudflare FL, I'd like to share some thoughts.
> This wasn't an attack, but a classic chain reaction triggered by “hidden assumptions + configuration chains” — permission changes exposed underlying tables, doubling the number of lines in the generated feature file. This exceeded FL2's memory preset, ultimately pushing the core proxy into panic.
> Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.
> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.
> Technical details: Even handling the unwrap correctly, an OOM would still occur. The primary issue was the lack of contract validation in feature ingest. The configuration system requires “bad → reject, keep last-known-good” logic.
> Why did it persist so long? The global kill switch was inadequate, preventing rapid circuit-breaking. Early suspicion of an attack also caused delays.
> Why not roll back software versions or restart?
> Rollback isn't feasible because this isn't a code issue—it's a continuously propagating bad configuration. Without version control or a kill switch, restarting would only cause all nodes to load the bad config faster and accelerate crashes.
> Why not roll back the configuration?
> Configuration lacks versioning and functions more like a continuously updated feed. As long as the ClickHouse pipeline remains active, manually rolling back would result in new corrupted files being regenerated within minutes, overwriting any fixes.
This tweet thread invokes genuine despair in me. Do we really have to outsource even our tweets to LLMs? Really? I mean, I get spambots and the like tweeting mass-produced slop. But what compels a former engineer of the company in question to offer LLM-generated "insight" to the outage? Why? For what purpose?
* For clarity, I am aware that the original tweets are written in Chinese, and they still have the stench of LLM writing all over them; it's not just the translation provided in the above comment.
This particular excerpt is reeking of it with pretty much every line. I'll point out the patterns in the English translation, but all of these patterns apply cross-language.
"Classic/typical "x + y"", particularly when diagnosing an issue. This one is a really easy tell because humans, on aggregate, do not use quotation marks like this. There is absolutely no reason to quote these words here, and yet LLMs will do a combined quoted "x + y" where a human would simply write something natural like "hidden assumptions and configuration chains" without extraneous quotes.
> The configuration system requires “bad → reject, keep last-known-good” logic.
Another pattern with overeager usage of quotes is this ""x → y, z"" construct with very terse wording.
> This wasn't an attack, but a classic chain reaction
LLMs aggressively use "Not X, but Y". This is also a construct commonly used by humans, of course, but aside from often being paired with an em-dash, another tell is whether it actually contributes anything to the sentence. "Not X, but Y" is strongly contrasting and can add a dramatic flair to the thing being constrasted, but LLMs overuse it on things that really really don't need to be dramatised or contrasted.
> Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.
Two lists of three concepts back-to-back. LLMs enjoy, love, and adore this construct.
> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.
This kind of completely vapid, feel-good word soup utilising a heroic analogy for something relatively mundane is another tell.
And more broadly speaking, there's a sort of verbosity and emptiness of actual meaning that permeates through most LLM writing. This reads absolutely nothing like what an engineer breaking down an outage looks like. Like, the aforementioned line of... "Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.". What is that actually communicating to you? It piles on technical lingo and high-level concepts in a way that is grammatically correct but contains no useful information for the reader.
Bad writing exists, of course. There's plenty of bad writing out there on the internet, and some of it will suffer from flaws like these even when written by a human, and some humans do like their em-dashes. But it's generally pretty obvious when the writing is taken on aggregate and you see recognisable pattern after pattern combined with em-dashes combined with shallowness of meaning combined with unnecessary overdramatisations.
> This wasn't an attack, but a classic chain reaction triggered by “hidden assumptions + configuration chains” — permission changes exposed underlying tables, doubling the number of lines in the generated feature file. This exceeded FL2's memory preset, ultimately pushing the core proxy into panic.
> Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.
> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.
> Technical details: Even handling the unwrap correctly, an OOM would still occur. The primary issue was the lack of contract validation in feature ingest. The configuration system requires “bad → reject, keep last-known-good” logic.
> Why did it persist so long? The global kill switch was inadequate, preventing rapid circuit-breaking. Early suspicion of an attack also caused delays.
> Why not roll back software versions or restart?
> Rollback isn't feasible because this isn't a code issue—it's a continuously propagating bad configuration. Without version control or a kill switch, restarting would only cause all nodes to load the bad config faster and accelerate crashes.
> Why not roll back the configuration?
> Configuration lacks versioning and functions more like a continuously updated feed. As long as the ClickHouse pipeline remains active, manually rolling back would result in new corrupted files being regenerated within minutes, overwriting any fixes.
https://x.com/guanlandai/status/1990967570011468071