To expand on why this made me think of the Google outage: It was a global backbo...

jeffbee · on Oct 5, 2021

Google also had a runaway automation outage where a process went around the world "selling" all the frontend machines back to the global resource pool. Nobody was alerted until something like 95% of global frontends had disappeared.

This was an important lesson for SREs inside and outside Google because it shows the dangers of the antipattern of command line flags that narrow the scope of an operation instead of expanding it. I.e. if your command was supposed to be `drain -cell xx` to locally turn-down a small resource pool but `drain` without any arguments drains the whole universe, you have developed a tool which is too dangerous to exist.

vitus · on Oct 5, 2021

Agreed, but with an amendment:

If your tool is capable of draining the whole universe, period, it is too dangerous to exist.

That was one of the big takeaways: global config changes must happen slowly. (Whether we've fully internalized that lesson is a different matter.)

ethbr0 · on Oct 5, 2021

As FB opines at the end, at some point, it's a trade-off between power (being able access / do everything quickly) and safety (having speed bumps that slow larger operations down).

The pure takeaway is probably that it's important to design systems where "large" operations are rarely required, and frequent ops actions are all "small."

Because otherwise, you're asking for an impossible process (quick and protected).

fragmede · on Oct 6, 2021

SREs live in a dangerous world, unfortunately. It's entirely possible the "tool" in question is a shell script that gets fed a list of bad cells but some bug causes it to get a list of all the cells instead.

Some tools are well engineered, capable of the Sisyphean task of globally deploying updates but others are rapid prototypes that, sure, are too dangerous to exist, but the whole point of SREs being capable programmers is that the work has problems that are most efficiently solved with one-off code that just isn't (because it can't be) rigorously tested before being used. You can bet there was some of that used in recovering from this incident. (I'm sure there were many eyes reviewing the code before being run, but that only goes so far when you're trying to do something that you never expected, like having to revive Facebook.)

XorNot · on Oct 6, 2021

The other problem is scale: the standard "save me" for tools like this is a --doit and --no-really-i-mean-it and defaulting to a "this is what I would've done" mode. That falls apart the moment the list of actions is longer then the screen but you're expecting that: after all how can you really tell the difference unless the console scrolls for a really long time?

There's solutions to that, but of course these sorts of tools all come into existence well before the system reaches a size where how they work becomes dangerous.

bbarnett · on Oct 5, 2021

If your tool is capable of draining the whole universe

Why did I think of humans, when I read this. :P

nathanyz · on Oct 5, 2021

I feel like this explains so much about why the gcloud command works the way it does. Sometimes feels overly complicated for minor things, but given this logic, I get it.

tantalor · on Oct 5, 2021

But the FB outage was not a configuration change.

> a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network

vitus · on Oct 5, 2021

From yesterday's post:

"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication.

...

Our services are now back online and we’re actively working to fully return them to regular operations. We want to make clear that there was no malicious activity behind this outage — its root cause was a faulty configuration change on our end."

Ultimately, that faulty command changed router configuration globally.

The Google outage was triggered by a configuration change due to an automation system gone rogue. But hey, it too was triggered by a human issuing a command at some point.

quietbritishjim · on Oct 5, 2021

I'm inclined to believe the later post as they've had more time to assess the details. I think the point of the earlier post is really to say "we weren't hacked!" but they didn't want to use exactly that language.

narrator · on Oct 5, 2021

This is kind of like Chernobyl where they were testing to see how hot they could run the reactor to see how much power it could generate. Then things went sideways.

formerly_proven · on Oct 5, 2021

The Chernobyl test was not a test to drive the reactor to the limits, but actually a test to verify that the inertia of the main turbines is big enough to drive the coolant pumps for X amount of time in the case of grid failure.

_akoy · on Oct 5, 2021

Of possible interest:

https://www.youtube.com/watch?v=Ijst4g5KFN0

This is a presentation to students by an MIT professor that goes over exactly what happened, the sequence of events, mistakes made, and so on.

formerly_proven · on Oct 5, 2021

Warning for others: I watched the above video and then watched the entire course (>30 hours).

XorNot · on Oct 6, 2021

Now I know what I'm doing the rest of this week...

fabian2k · on Oct 5, 2021

As already said the test was about something entirely different. And the dangerous part was not the test itself, but the way they delayed the test and then continued to perform it despite the reactor being in a problematic state and the night shift being on duty, who were not trained on this test. The main problem was that they ran the reactor at reduced power long enough to have significant xenon poisoning, and then put the reactor at the brink when they tried to actually run the test under these unsafe conditions.

willcipriano · on Oct 5, 2021

I'd say the failure at Chernobyl was that anyone who asked questions got sent to a labor camp and the people making the decisions really had no clue about the work being done. Everything else just stems from that. The safest reactor in the world would blow up under the same leadership.

q845712 · on Oct 5, 2021

At first i thought it was inappropriate hyperbole to compare Facebook to Chernobyl, but then i realized that i think Facebook (along with twitter and other "web 2.0" graduates) has spread toxic waste across far larger of an area than Chernobyl. But I would still say that it's not the _outage_ which is comparable to Chernobyl, but the steady-state operations.

ratsmack · on Oct 5, 2021

>internal tools / debugging workflows were also impacted

That's something that should never happen.