To expand on why this made me think of the Google outage:
It was a global backbone isolation, caused by configuration changes (as they all are...). It was detected fairly early on, but recovery was difficult because internal tools / debugging workflows were also impacted, and even after the problem was identified, it still took time to back out the change.
"But wait, a global backbone isolation? Google wasn't totally down," you might say. That's because Google has two (primary) backbones (B2 and B4), and only B4 was isolated, so traffic spilled over onto B2 (which has much less capacity), causing heavy congestion.
Google also had a runaway automation outage where a process went around the world "selling" all the frontend machines back to the global resource pool. Nobody was alerted until something like 95% of global frontends had disappeared.
This was an important lesson for SREs inside and outside Google because it shows the dangers of the antipattern of command line flags that narrow the scope of an operation instead of expanding it. I.e. if your command was supposed to be `drain -cell xx` to locally turn-down a small resource pool but `drain` without any arguments drains the whole universe, you have developed a tool which is too dangerous to exist.
As FB opines at the end, at some point, it's a trade-off between power (being able access / do everything quickly) and safety (having speed bumps that slow larger operations down).
The pure takeaway is probably that it's important to design systems where "large" operations are rarely required, and frequent ops actions are all "small."
Because otherwise, you're asking for an impossible process (quick and protected).
SREs live in a dangerous world, unfortunately. It's entirely possible the "tool" in question is a shell script that gets fed a list of bad cells but some bug causes it to get a list of all the cells instead.
Some tools are well engineered, capable of the Sisyphean task of globally deploying updates but others are rapid prototypes that, sure, are too dangerous to exist, but the whole point of SREs being capable programmers is that the work has problems that are most efficiently solved with one-off code that just isn't (because it can't be) rigorously tested before being used. You can bet there was some of that used in recovering from this incident. (I'm sure there were many eyes reviewing the code before being run, but that only goes so far when you're trying to do something that you never expected, like having to revive Facebook.)
The other problem is scale: the standard "save me" for tools like this is a --doit and --no-really-i-mean-it and defaulting to a "this is what I would've done" mode. That falls apart the moment the list of actions is longer then the screen but you're expecting that: after all how can you really tell the difference unless the console scrolls for a really long time?
There's solutions to that, but of course these sorts of tools all come into existence well before the system reaches a size where how they work becomes dangerous.
I feel like this explains so much about why the gcloud command works the way it does. Sometimes feels overly complicated for minor things, but given this logic, I get it.
> a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network
"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication.
...
Our services are now back online and we’re actively working to fully return them to regular operations. We want to make clear that there was no malicious activity behind this outage — its root cause was a faulty configuration change on our end."
Ultimately, that faulty command changed router configuration globally.
The Google outage was triggered by a configuration change due to an automation system gone rogue. But hey, it too was triggered by a human issuing a command at some point.
I'm inclined to believe the later post as they've had more time to assess the details. I think the point of the earlier post is really to say "we weren't hacked!" but they didn't want to use exactly that language.
This is kind of like Chernobyl where they were testing to see how hot they could run the reactor to see how much power it could generate. Then things went sideways.
The Chernobyl test was not a test to drive the reactor to the limits, but actually a test to verify that the inertia of the main turbines is big enough to drive the coolant pumps for X amount of time in the case of grid failure.
As already said the test was about something entirely different. And the dangerous part was not the test itself, but the way they delayed the test and then continued to perform it despite the reactor being in a problematic state and the night shift being on duty, who were not trained on this test. The main problem was that they ran the reactor at reduced power long enough to have significant xenon poisoning, and then put the reactor at the brink when they tried to actually run the test under these unsafe conditions.
I'd say the failure at Chernobyl was that anyone who asked questions got sent to a labor camp and the people making the decisions really had no clue about the work being done. Everything else just stems from that. The safest reactor in the world would blow up under the same leadership.
At first i thought it was inappropriate hyperbole to compare Facebook to Chernobyl, but then i realized that i think Facebook (along with twitter and other "web 2.0" graduates) has spread toxic waste across far larger of an area than Chernobyl. But I would still say that it's not the _outage_ which is comparable to Chernobyl, but the steady-state operations.
It was a global backbone isolation, caused by configuration changes (as they all are...). It was detected fairly early on, but recovery was difficult because internal tools / debugging workflows were also impacted, and even after the problem was identified, it still took time to back out the change.
"But wait, a global backbone isolation? Google wasn't totally down," you might say. That's because Google has two (primary) backbones (B2 and B4), and only B4 was isolated, so traffic spilled over onto B2 (which has much less capacity), causing heavy congestion.