Cold restart whole system after total outage

gumby · on July 20, 2023

The full telephone system, which the author starts with, may not be restartable. Sure, you could restart the SS7 databases and computers, but the control plane runs over the data plane, which is configured via...the control plane. Originally the network controls were literally operators (humans), but bit by bit parts were incrementally automated, pulling the system slowly (over decades) by its bootstraps, which were gradually decommissioned as they weren't needed any more.

I have a friend who knows a lot about the phone system (he has a security clearance for some of his telephone work). One time we had a long conversation about this topic, until at one point he said "and let's talk about something else" -- I guess from that point some of the details are classified. So maybe there is a plan, or maybe they just designed the system in such a way that they could convince themselves that it would not go down unless things were so severe that loss of the phone system would not be your chief worry.

---

In September 2001 there was a full standdown of US airspace. That was accomplished pretty quickly: "you are ordered to land immediately on the closest airport that can handle your aircraft, or be shot down". Undoing that, however, took some careful planning! Fortunately the standdown lasted several days so there was time to work it out. Even if you had a plan for this (and I assume FAA had one), figuring out what the realities on the ground were and matching them up with the plan was nontrivial.

Apparently some of the planes landed where they could not take of again unless they were empty with a small amount of fuel to get to an airport designed for them. I don't believe I heard that any planes landed where they could never leave.

protastus · on July 20, 2023

> So maybe there is a plan, or maybe they just designed the system in such a way that they could convince themselves that it would not go down unless things were so severe that loss of the phone system would not be your chief worry.

My belief from working in very large companies, and (previously) in mission critical systems is that a clean bootstrap and recovery process is extremely unlikely, almost impossible. Because in complex systems full of legacy parts and people who have long retired, the stars won't align.

The only way to truly know is to design and periodically test for disaster scenarios (emphasis on the plural). But due to the scale in time and space, cost and bureaucracy, this planning and rehearsing is not going to happen with the desired detail and intensity. People do not seriously plan for things that have never happened.

If it does happen, there will be a small group of extremely capable people that will find a way to bootstrap the system. It won't be according to some previously laid out plans -- they will make the plan in real time. They're not famous and probably never will be.

colechristensen · on July 20, 2023

Eh, would not surprise me in the slightest if there was a secret billion dollar program that specifically practiced disaster recovery for the phone network. The government doesn't have the same motivations as a business and they spend a lot of money in a lot of places just to be prepared for unlikely events. Like we spend billions on military hardware we don't forsee ever needing _to keep the engineering capacity to design and build military hardware_.

fluoridation · on July 20, 2023

>Like we spend billions on military hardware we don't forsee ever needing _to keep the engineering capacity to design and build military hardware_.

Well, more because the military industrial complex lines the pockets of your politicians, who in turn decide how to spend the budget.

runlaszlorun · on July 20, 2023

The two are not mutually exclusive. In fact they reinforce each other.

donalhunt · on July 20, 2023

In Ireland, we just build a runway to allow the plane to take off again.

https://www.rte.ie/archives/2018/0521/965058-mexican-lands-p...

wongarsu · on July 20, 2023

Clear evidence that news was more entertaining in the 80s:

    In the merry month of May
    Just before the dawn of day
    A plane flew in for Shannon to refuel.
    Because Shannon is fogged out,
    Their are the rite of ought
    To touch down in Cork Airport as a rule.
    
    As he flew towards Mallow town 
    His supply of fuel was down
    But the pilot was as cool as cool could be.
    In a racetrack west of town
    He made a safe touch down
    Just beside the Mallow sugar factory.

unmole · on July 21, 2023

Hunt the hare and turn her down the rocky road... Wait, what?

wahern · on July 21, 2023

It appears the town is still celebrating the event: https://www.independent.ie/regionals/cork/news/family-of-mex... Publication date is 2023-03-22, so 40th anniversary.

joncrane · on July 20, 2023

And in the meantime they had the pilot judge a beauty contest. What a story!

dredmorbius · on July 30, 2023

There's the story of TACA 110, a Boeing 737-300 which landed on a grass levee, and took off from an auto-traffic boulevard (albeit one built on a prior NASA airfield runway).

<https://en.wikipedia.org/wiki/TACA_Flight_110>

myself248 · on July 20, 2023

I've long had a fascination with aviation incidents (the Gimli Glider happened on my birthday), but I hadn't heard of this one before!

What a great story. Thank you for posting it.

psd1 · on July 21, 2023

Gimli Glider incident: https://simpleflying.com/gimli-glider/

LoganDark · on July 20, 2023

this is truly one of the takeoffs of all time

walrus01 · on July 20, 2023

If you dig deep enough into the SS7 stuff running in a modern regional ILEC it's way more fragile than you might think. Mostly because it's no longer treated as an absolutely cannot fail thing that is also a primary source of revenue like back in the days when everyone has a POTS line and tons of money came in from long distance bills. Many operators are decommissioning stuff like 5ESS and Nortel switches and moving to modern soft switches as quickly as they can.

The network stuff underpinning a lot of critical tdm phone traffic these days is like a collection of 23 year old Cisco 15454 held together by spare parts and a few people who care about them.

myself248 · on July 20, 2023

I've been out of the industry just long enough to remember that Cerent 454's got rebranded as Cisco 15454's right near the end of my tenure...

Yow. Way to make a guy feel old! :P

walrus01 · on July 20, 2023

Mostly I was using the 15454 as an example, there's lots of other 20 to 30 year old stuff out there in the TDM transport sector that's only available on ebay, through weird used equipment dealers, or by finding a decom from another ISP/telco.. Stuff like T1 (or DS0!) to DS3 mux/demux to attach to a 15454, or similar. There's literally 911 call center transport circuits being held together by the telco equivalent of duct tape and string right now, nobody notices until it breaks.

One of the weird challenges in building a new state-of-the-art inter city DWDM transport network now is dealing with things like legacy customers that have one OC48 and are unlikely to drop it any time soon, it's a considerable monthly revenue source, and have to deal with stuffing that into the system along with 100Gbps and greater coherent circuits.

Also from a customer relations perspective sometimes the customer literally forgets that they have this extremely exensive DS3 or OC48 or something in monthly recurring billing, and you don't want to bring it to the attention of management, because they might go "are we still using this?" and cancel it.

LeoPanthera · on July 20, 2023

> There's literally 911 call center transport circuits being held together by the telco equivalent of duct tape and string right now, nobody notices until it breaks.

And break it does.

https://www.kron4.com/news/bay-area/911-dispatch-system-in-o...

com · on July 21, 2023

I contracted for a telco who held a national emergency number contract, and after they fired all the old engineers and the inevitable outages started happening, had to budget a significant amount per month for flowers for families of the deceased who weren’t able to reach anybody during heart attacks etc.

It would have been better if it was really big SLA breach fines however.

bobthepanda · on July 20, 2023

The 2001 shutdown is even more crazy when you consider that for the FAA administrator who ordered it, it was his first day on the job. Hell of a first day.

Arrath · on July 20, 2023

Hell of a day to quit sniffing glue.

kotaKat · on July 20, 2023

> I don't believe I heard that any planes landed where they could never leave.

This has happened before outside of 2001, albeit not really a DR issue but more a political issue -- if we look to the Meigs Field destruction by former Chicago mayor Richard Daley, multiple aircraft were left stranded with a now destroyed runway. (The solution was to just give them special clearance to take off on a taxiway, but still.)

https://web.archive.org/web/20110720045652/http://www.aopa.o...

adityaathalye · on July 20, 2023

OP here. Thanks for the remarks! That tale is apocryphal to me. I found it amusing, and telling in the sense that disasters are the last thing people think of, and for large enough systems (especially ones that have accreted over human generations) the organisational knowledge has probably been lost to retirement and death. And if you're lucky, maybe a scrap of it has survived. Then one's job is to do the requisite software archaeology to figure out what one's people might have entirely forgotten.

Also, if that script remained valid at the time, I doubt it would do any critical actions. It might have been a sort of literal script to follow --- run the script, see what it says, do a thing, run the script, and so forth. Its supposed job was to help humans solve a bootstrap problem.

I see how my wording of that passage makes it sound like the be-all end-all of cold booting a telco. But that's what we get when we wall-of-text in our Slacks :D

(edit: clarifying remarks)

tinus_hn · on July 21, 2023

You know you can actually bring fuel to a plane and refuel it wherever it is? And that’s probably the answer to the first question, it can be done but you you have to have someone who knows what he’s doing think outside the box a little.

js2 · on July 20, 2023

It's 1995 or so. I'm at U.F. getting my CS degree. Our facilities guy is giving a tour of the department's server room to some big wigs.

For some reason, he decides to demo the UPS cut-over switch. I have no idea why. But he manages to toggle it the wrong way and instead of switching the entire room full of servers to the UPS, he manages to cut power to _all of them_.

My recollection is that the cooling went out too and the room was suddenly very silent. But in retrospect that doesn't make sense.

What I do remember is that it was non-trivial to bring all our Unix servers back up because over the years they had been setup with NFS mounts in a loop such that for A to boot, it needed B to be up, which needed C to be up, which needed A to be up.

Oops.

So it took a lot of manual intervention to bring everything back up.

rout39574 · on July 20, 2023

I was in that room! Poor fellow. It was the demo of our new UPS; I can still remember the gradual fading fans and clicks of capacitors.

He was really nervous showing the new install off to us, and he was just talking through the positions of the switch; but he actually moved the switch to each of them as he did it.

Back in the days the conslops ran the asylum. :)

Any old school UF types looking at this, The Dog list lives on, and is again meeting from time to time. :) But at quieter venues. Our ears have unaccountably gotten old.

zamadatix · on July 20, 2023

The heck is a conslops besides the food at the prison?

rout39574 · on July 21, 2023

Consultant / operator.

We were student employees, who were nonetheless in charge of all the systems. It was an absolutely awesome experience.

js2 · on July 21, 2023

At least as valuable as my actual CS classes.

djbusby · on July 20, 2023

Console Operators

dunno7456 · on July 21, 2023

s/capacitors/relay/

Sorry, couldn't hold myself :)

jen729w · on July 20, 2023

Vaguely related, in 2015 we were building a new platform based on Cisco UCS and NetApp filers.

Cisco had a virtual router. N1000 perhaps, the hardware was a C220 and it had some sort of appliance running on it. Which depended on some sort of LUN from NetApp for its storage. But you couldn’t stand the LUN up until the vCenter was up because UCS provisioned the LUN, it didn’t work if you did it manually, and that ran on vCenter, and the vCenter depended on the router so that it could reach the LUN. It was pure circular logic hell.

There was a path to make this work, but it literally took half a dozen very ~~clever~~ expensive Cisco and NetApp engineers a couple of days in a room with a whiteboard to figure it out. It was absurd.

spc476 · on July 20, 2023

I've got two such stories. Around 2000, I was working at a huge web hosting company (third shift, monitoring the network). Suddenly the power went out. Turns out, the building management (separate company) decided to run a UPS test and the memo got lost. Fun time that night.

Second story, around 2005---I, along with a friend and my father, were in Las Vegas eating lunch at one of the major casinos when the power went out completely. It was eerily silent and dark! (and then slowly, we started hearing the groans of slotzombies rising among us) I'm sure someone lost their job for a UPS cut-over failure.

dunham · on July 20, 2023

We had a blackout (seemed to happen every fall when the students arrived - the university had a power station), the NFS server needed NIS to boot, and NIS server required NFS. We managed to manually get the NIS service running in single user mode, brought up everything else, and then rebooted NIS.

DoneWithAllThat · on July 21, 2023

I remember a day when a disgruntled employee shut off an entire Exodus data center by slapping the EPO switch. Took a long time and a lot of work to get everything in our cage back up.

lantry · on July 20, 2023

In the anecdote about Bill and the DISASTER script, I'm not so sure that deleting the script would be such a big deal. If this script hasn't been touched since the 1980s and nobody knows what it does, presumably nobody has tested it recently.

It seems like if there really was a disaster, first of all nobody would know that script existed, and second of all if they tried to run the script, it would fail because of all the changes to the system since the script was initially developed.

Isn't there some saying like "if you don't test your backups, you don't have backups" or something like that?

wkdneidbwf · on July 20, 2023

right? that whole bit reads like some lame parable. like who in there right mind is going to run a 10k line shell script named DISASTER they’ve never read and cannot read because it’s 10k lines of shell? there is apparently no documentation (and positively no tests)? one guy close to retirement remembers what it’s for and says “don’t delete this critical but of code!”

it’s just utter bullshit.

chubot · on July 20, 2023

If tens of millions of dollars are on the line, you will be able to find someone who can run the script or derive enough knowledge from it

In a disaster scenario, something is better than nothing

pavel_lishin · on July 20, 2023

Bill is clearly still picking up the phone. He'd likely be amenable to picking up a paycheck as well.

fragmede · on July 22, 2023

Not if he's dead. Some of the stuff is old enough that it's a reasonable question.

wkdneidbwf · on July 20, 2023

it’s more that i don’t believe it’s a real scenario.

adityaathalye · on July 20, 2023

Isn't that the entire point? People don't believe (or choose to not believe) a certain disaster scenario is valid, until it happens. We all have seen first-hand the many examples of colossal failures of disaster-response planning in our recent planet-wide emergency. As have we seen the creative, dogged, herculean efforts to cope with it.

adityaathalye · on July 20, 2023

OP here... As I wrote here, it is better to think of the story as apocryphal: https://news.ycombinator.com/item?id=36798893

Also of course it will be crazy to read a giant shell script. But then again if the stakes are high enough, and if it yields even one critical piece of information, then it's worth it.

The larger point is that organisational knowledge clings on in strange ways. In a crazy disaster scenario, people may appreciate having access to anything they can get their hands on.

Gabrys1 · on July 20, 2023

10K lines is not _that_ large. If it was written sensibly, it might be very useful. Bill might have written a text document, but chose to use Shell as a preferred engineers' language. Who doesn't share a one-liner with a colleague in need? Bill shared a 10k-liner :-). And Shell being a relatively high-level language, it probably packs the information more densely than a text file.

adityaathalye · on July 21, 2023

True. And let me go a step further to argue that someone so forward-looking as to (allegedly) make a script for an all-caps DISASTER scenario would be intelligent about the contents of said script. They would have cared deeply for their fellow line staff of the then-unknown future.

perrygeo · on July 20, 2023

"Nobody cares if you have backups. Everyone cares that you can restore."

Classic problem of deferred costs. Backups cost money and it's tempting to avoid investment in them (ie fail to test them) but that can bite you when its least convenient.

LeoPanthera · on July 20, 2023

I bet it still would have been a useful template for a human to read to get a general idea of what things to do and in what order.

wkdneidbwf · on July 20, 2023

good luck reading 10k lines of shell written decades ago. it would likely be an incredible waste of time.

tivert · on July 20, 2023

> good luck reading 10k lines of shell written decades ago. it would likely be an incredible waste of time.

If the entire telephone system was down and needed cold started, and the script had information someone needed to do that, someone would take the time to read it. Maybe not run it, but definitely read it to extract clues.

I mean, it's not like it's binary. It's totally possible.

Gabrys1 · on July 20, 2023

Based on ChatGPT, assuming 10k lines translates to around 30k words, it should take about 3hrs to read it. Multiply that by your favorite factor for read and understand. Split that to a few people, skim parts that are not applicable etc. All in one this seems easily readable in sensible time.

thelastparadise · on July 20, 2023

I know, right?

Can you even imagine the alternative? "Hey lets throw this maybe incredibly helpful shell script in the trash. Because it's too long."

weinzierl · on July 21, 2023

It's likely that enough infrastructure had changed that the script would not run successfully anyway. On the other hand it would have been excellent documentation about the system at a certain point in time.

I say excellent, not because I assume it would have been well written, but because when it comes to these things I prefer code (any form of code, indeed) that I can be reasonably sure worked at some time over natural language documentation any time.

I completely agree with Dr. House when he says he never asks the patient, because they lie all the time. Same with human written documentation.

mastax · on July 20, 2023

I have a hard time believing the 10k line shell script didn't have a comment at the top saying what it did.

alexfoo · on July 21, 2023

#!/bin/sh

ramidarigaz · on July 20, 2023

Interestingly this paragraph isn't quite true:

> So much of the modern world depends on our mastery over materials (to make a precision screw, you need a precision-machined harder material—diamond / titanium—to work on a softer material—steel), and our ability to turn rotary motion to linear motion (it's stupidly difficult to reliably precision-machine a harder material without even more precise linear + rotary motion—lathe/CNC machine). Hence, a bootstrap problem.

Steel is hardenable (or rather, some steels are hardenable), you can change its hardness through the specific application of heating and cooling. So you can make a crude tool with relatively soft steel, harden it, and use it to make a more precise steel tool (again machine soft, then harden). This does make the bootstrapping problem a bit easier, I think. Although not easy in the absolute.

See https://www.youtube.com/watch?v=V_Mp1fNzIT8 for a great dive into primitive steel hardening techniques.

hinkley · on July 20, 2023

There's a way to grind mirrors optics for optics with polishing stones that aren't even flat to the naked eye. Basically the system arrives at tiny tolerances via the process of using the system.

And there's way to make three perfectly flat sharpening stones by starting with three raw pieces of natural sharpening stone, just by alternately rubbing the three stones together until they flatten each other out.

Paul Sellers can teach you how to flatten a large board without a planer. He also has videos on how to get a wood plane perfectly flat using a large sharpening stone (which can be made as above or with float glass).

And if memory serves, you to make something perfectly round you first need something perfectly flat. Once you have something perfectly flat and something perfectly round it's off to the races.

Edit: "The Origins of Precision" is a half hour well spent https://www.youtube.com/watch?v=gNRnrn5DE58

jmholla · on July 21, 2023

Do you have links to the other videos about the other things you said besides the creation of something perfectly flat? I saw that covered in the Origins of Precision but was left wanting on the lens making and board flattening. It seemed to be more historical and less practical, from the perspective of bootstrapping things.

hinkley · on July 21, 2023

I think the same channel covered mirror grinding but I'm very fuzzy on that. Paul Sellers is all over Youtube. He's practically the elder statesman of hand tool woodworking. The video I'm thinking of, he's planing a board that is too big to go through a hobbyist's planer, so that one always made the most sense to me. Rob Cosman's might be easier to find, like https://www.youtube.com/watch?v=GGuGFGAQTxE

Flat boards require a flat plane, and like chisels, the tolerances on a new plane are fairly loose. Partly down to thermal contraction (from running the production line too fast? I've never gotten a straight answer). So the first thing you do with both is grind them truly flat, and you need a reference surface for that, like float glass or a diamond stone. Common protocol is to use the diamond stone only to flatten sharpening stones, and the sharpening stones to flatten chisels, and the chisels to flatten mortises. Basically diamond stones are very accurate but too expensive to have sufficient grit ratings and longevity

This is not the one I'm thinking of, but it's a taste:

https://www.youtube.com/watch?v=Cl5Srx-Ru_U

Right angles can be achieved by a process of iterative refinement. A square is two flat surfaces that are used to adjust two other flat surfaces, and they are only at right angles when 90.0º + 90.0º = 180.0º. So if you reverse the square or make two identical squares, they should touch along their entire length. If they don't then they're not square. Alternatively you can apply a square multiple times and check if the 1st and 3rd plane are perfectly parallel. Or if the 1st and 4th plane intersect at the same point, which also increases your accuracy by 4x by multiplying the error. I've seen this demoed by fine woodworkers squaring up a table saw for instance.

smcameron · on July 20, 2023

David Gingery's books might be of interest to anyone thinking of bootstrapping a metal working shop starting from charcoal and scrap aluminum.

https://www.gingerybookstore.com/

adityaathalye · on July 20, 2023

OP here. Thanks for the critique! Yes I agree fully. The specific example of diamond/titanium aside, the general point stays, I feel. A youtube rabbit hole is nigh, clearly :)

mikewarot · on July 20, 2023

I think that John Plant[1] and some friends could get us from the stone age to ironworking. He's shown how to start with stones and get that far, albeit on a very small scale.

Going from iron to precision screws is a matter of first making precision flat surfaces, then lathes, and onward from there.[2] You can do that with just iron and heat treating, but it won't be easy.

If you want an alternate history where something slightly less drastic is dealt with, the book "Ring of Fire - 1632" by the late Eric Flint[3] is an interesting place to start. In the book, a town from West Virginia circa 2000 is thrown back into the middle of the 30 years war in Germany. Lots of exposition of the book is about the supply chains we all depend on, and how they work. It's the start of an awesome series.

Books and working knowledge, are a precious resource. As long as we have a critical mass of them, and conditions remain reasonably tolerable for human life, we can recover.

[1] https://www.youtube.com/channel/UCAL3JXZSzSm8AlZyD3nQdBA

[2] https://ia800104.us.archive.org/20/items/FoundationsOfMechan...

[3] http://www.baen.com/chapters/0671578499/0671578499.htm

pavel_lishin · on July 20, 2023

Beware the 1632 series. You'll think that you're just picking up a fun "Connecticut Yankee in King Arthur's Court" adventure yarn, but then a year down the line you've read a full dozen, the library just knows to go ahead and order the next one for you once you pick one up, and you're wondering if you'll be able to finish the series before retirement.

https://en.wikipedia.org/wiki/List_of_books_in_the_1632_seri...

chipsa · on July 20, 2023

At this point, you’re likely to be about to finish the series, because it’s unlikely to get much longer. Eric Flint’s Wikipedia entry is now past tense. He died last year.

throwanem · on July 20, 2023

Could be worse. I found Weber's stuff as sticky once upon a time, and Eric Flint's a considerably more skillful writer. But I appreciate the warning all the same; I really don't need so weighty an obsession on top of all my other hobbies.

nickdothutton · on July 20, 2023

It’s been a few years but I used to run DR exercises for corporates. Cold start means your only possessions are the fire proof suitcase full of LTO-5s and the street address of the DR data center. 1 day to bootstrap essential infra services, after the end of the 2nd day you’d have most customer facing systems up, day 3 would be the non-essential stuff. Personally I’d do it without sleep, but most of the youngsters would need a break. Pretty exhilarating, as IT work goes. Always use the feature that generates multiple index tapes of what backup set is on what numbered tape :-)

adityaathalye · on July 20, 2023

OP here. That would be quite the trip! Finding oneself the front line of something like that would easily bring out the best---and the worst---in a person. A rare chance to form lifelong collegial bonds, and perhaps to exorcise a personal demon or two (fear, anger, egotism ...).

nickdothutton · on July 20, 2023

The only reason I was brought in to do the exercise as a consultant, was because the guy supposed to be doing it was too frightened, delayed for almost a year, went on long term sick leave and was eventually fired/quit. The docs and ops procedures actually looked good, so I took on the challenge. In the end, just 1 flexvol that had been configured manually years before, and was missed by the DR automation. Not a bad result.

jiggawatts · on July 21, 2023

I had the misfortune of having to do this twice for real.

I quit the storage / backup industry because 7:30am phone calls would make me hyperventilate in a panic.

tivert · on July 20, 2023

> Another colleague in the chat remarked up-thread (apropos cold reboot thinking):

> I have seen this at <Indian eCommerce Giant> and at <a FAANG>. Most of it is related to cached data. Cold starts with empty caches causes too much load on databases. And then the failures cascade.

> — Another M'colleague in the Slackroom.

Isn't that not really a problem with cold restart per se, but more the restart procedure? If caches are so critical, wouldn't you need a feature to throttle the load to what the databases can handle, as the caches populate? E.g., if you're cold-rebooting Facebook, start by blocking all connections except those geolocated to North Dakota, then add other regions as your caches fill.

benlivengood · on July 20, 2023

Specifically you want load shedding and in the servers and retry with back off in the clients. Clients should do their best to exponentially back off on retrying failed requests and only try to contact healthy servers and maintain internal rate-limiting based on error rate, and servers should do their best to reply with failure quickly and cheaply to drive good client's backoffs/rate-limiting and just drop bad client's traffic, and the service discovery should try to detect and spread load across healthy servers (but this isn't always available at first in a cold start because the metrics or metadata probably aren't available yet), but in the end it's up to servers to reliably drop traffic they can't handle instead of building up giant queues and slowing to a crawl. Middleware is the hardest because it has to be a good client and also fail fast on overload as a server by correctly interpreting upstream behavior. Deadlines in RPCs that get passed across system boundaries can work pretty well for tall stacks of system layers where service health discovery or dependency discovery is hard, but require careful configuration to avoid failure or very slow starts under heavy load.

jbnorth · on July 20, 2023

That’s spot on. I work at a large cloud provider and one of our larger eCommerce customers had an outage in a kubernetes cluster which handled the front end traffic routed through a large CDN provider. Well sure enough “just turn it back on” wasn’t an option since the surge of traffic was too rapid for the services and the cluster to scale out. They ended up having to turn the traffic back on incrementally to let things scale up to the point where they could handle the load.

donalhunt · on July 20, 2023

One of the earliest incidents I worked on in the late 90s involved students DDOSing a university webserver in anticipation of exam results being posted. The server load was so high we had to pull the physical plugs on the server. :/

drbawb · on July 20, 2023

I'm reminded of Bryan Cantrill's talk "Debugging Under Fire"[1], which includes a retrospective of sorts about an entire datacenter rebooting.[2] That is a pretty large-scale disaster, but even that is a rung below a continent-wide outage. Poor "Bill" must have saw the proverbial light when he heard some folks wanted to trash the DIASTER script.

[1]: https://www.youtube.com/watch?v=30jNsCVLpAE

[2]: https://www.tritondatacenter.com/blog/postmortem-for-outage-...

johngalt · on July 20, 2023

At a certain point, you aren't doing a cold restart, but a high speed recreation of the system based on prioritized needs.

thelastparadise · on July 20, 2023

It seems you've been there too :)

sangnoir · on July 21, 2023

The author appears to have missed cyclic dependencies as a barrier to cold restarts. In large systems, you could have built critical component A with no dependencies, and later on component B is built that depends on A being up. "A" gets redesigned and now depends on "B" being up (which is fine since B is already up). The knowledge of how to build (the original) A from scratch withers as old-timers shuffle off the organization's payroll (or mortal coil), and suddenly, you have a system that's impossible to bootstrap from 0.

adityaathalye · on July 21, 2023

OP here. A good point! I suppose we could say that modern paper currency was system B, built on top of bullion / system A ("dependency-free" as in gold exists in the Earth's crust independent of humans). And at some point currency was liberated from those earthly bindings, and became free-floating.

So then question would be how to bring back free-floating currency, if we somehow shot ourselves back to the dark ages and forgot all about it.

"Lots of luck" would be my best guess. (Because if we manage to go back there, we are likely to succeed at going back further and forgetting even more. These sorts of swings have multi-generational momentum I suppose.)

> The author appears to have missed cyclic dependencies as a barrier to cold restarts.

P.S. I am sure I have missed plenty in that stream-of-consciousness post (and mistaken a bunch too)!

pixl97 · on July 20, 2023

Heh, Microsoft AD + DNS + VMs is a common 'cold start' trap for the inexperienced.

There was a story around this during the iraq war where a US military virtual machine system when down, and had to come back up without internet. Problem was VMware needed DNS to start the VMs, one of the VMs it needed was Active Directory for security, AD hosts the DNS and now you're locked up without an external running system.

DNS itself is typically a cold start nightmare.

mikewarot · on July 20, 2023

I had an old 486dx-50 as a backup domain controller for just that reason.

hlandau · on July 20, 2023

This is called 'blackstart', particularly in the energy sector.

My earliest exposure to this concept as a child was watching the film Jurassic Park. As someone fascinated by systems I found the idea of having to bring the whole system back up from scratch pretty interesting.

Today I still find these kinds of bootstrapping processes fascinating - both these megascale processes, but also the boot process that occurs whenever you turn on your computer. The latter is probably one of the most Rube Goldbergian feats of engineering with us today that actually still achieves a useful purpose. In fact it's absurd how Rube Goldbergian it is. And the complexity of the boot processes for modern systems (see [1] for a small glimpse) is extraordinary.

When you turn on your computer, it's like you're re-executing the entire process of a civilization bringing itself into being, gradually developing progressively more sophisticated technologies: at first RAM isn't working, but then you get RAM working and that lets you get progressively more sophisticated parts of the hardware working, etc.

By comparison, humans have no "automatic boot process". We're constructed in the 'on' state via fork(). So this repetition of entire process of, ah, 'abiogenesis' whenever you turn on your computer is kind of insane by comparison. Entire kingdoms of hardware state rise and fall with the press of a power button.

As an aside, I'm fond of the Red Dwarf novels, which are set on a massive mothership-type spaceship. The ship was constructed in space and never designed to enter a planet's atmosphere. In particular, the ship, and its engines, was constructed in the 'on' state by the crew that built it originally. It was never conceived that the ship would ever need to be rebooted, because it was assumed once the engines were initially fired during the commissioning of the ship, they would never be turned off until decommissioning. Thus, the ship has no automatic boot process for the engines, only a manual engine firing procedure which is extraordinarily arduous and long-winded and takes weeks to execute, said procedure having been included in the manual only as a curiosity more than anything else. This idea of a ship built "on" under the assumption it would never once be shut down or "restarted" until decommissioning is interesting, but of course also directly mirrors biological life.

[1] https://www.devever.net/~hl/backstage-cast

potmat · on July 20, 2023

"Under the words 'Contact Position' there's a button that says 'Push to Close'".

"Push it."

When Spielberg was on he was ON! Who else could take a scene like "they have to reset the circuit breakers" and make you absolutely on the edge of your seat over it.

hlandau · on July 21, 2023

Out of curiosity I actually looked up the details on this scene a while back. There is some actual real model of circuit breaker this scene was based on; I think someone dug up the model number, but I can't remember it now.

It turns out a lot of high-power circuit breakers need the energy from a clockwork mechanism in order to open in the event of a fault. So the thing where you have to pump the handle is actually to charge a clockwork mechanism to ensure there's enough mechanical energy to open in the event of a fault - AIUI. Presumably, the breaker is designed with safety in mind and won't let you push-to-close until you've done this.

potmat · on July 21, 2023

Neat! I just looked it up now, it's a Westinghouse SPB-100 (https://jurassicpark.fandom.com/wiki/Circuit_Breakers).

I always thought that the main breaker looked much more real than the "individual park systems". Turns out it was! The other "breakers" with the backlit names are definitely from the prop department

mike_hock · on July 20, 2023

> We're constructed in the 'on' state via fork().

Maybe some worms are, but "we" most certainly aren't. The closest analogy might be execve("/proc/self/exe"), but even that is flawed.

more_corn · on July 20, 2023

In some ways a disaster recovery plan can follow a better path than the original bootstrap. Imagine industrial society. On page two of the disaster recovery plan you have a description of germ theory. Something that has saved hundreds of millions of lives and would have saved hundreds of millions more had it been discovered / formalized thousands of years previously. People knew about sanitation in prehistory but the theory wasn’t formalized so doctors didn’t always wash their hands.

Likewise knowledge around nitrogen fixation and fertilizers.

There are probably a half a dozen huge improvements that could be made for bootstrap_society_v2.sh

Perhaps we should all write down our version and tuck it away somewhere safe just in case. Maybe on something more durable than paper. And certainly more durable than electronic storage.

galkk · on July 20, 2023

One of my stories of work as vendor on $large_bank is that per some folks from there , they weren’t able to do disaster recovery testing of their largest oracle database for years, and per procedure they should’ve do it like every 6 months.

kjellsbells · on July 20, 2023

For telco at least, the problem boils down to graceful random backoff at all interconnection points. Without that, you die miserably. So, for example, you can boot up your class 5 softswitch, but it will be unhappy if no sooner than you do that than the access network equipment starts spamming it with connection attempts (and, equally, it tries connecting trunks to the rest of the pstn). So you need things like congestion control, randomized backoff, etc. Telco switches can do this.

An amusing aside is that IP telephony sometimes gets into a mess here, as SIP sends a 503 with a retry after prompt to the client, but its not always randomized, so you can get these waves of barbarians at the gate, who all go away, and then all come back again exactly 180 seconds later...

anotherhue · on July 20, 2023

IMO if you can't cold start it you probably can't develop against it very quickly.

Then again we couldn't cold start a supply chain or a semi fab or humanity itself so maybe that's the default.

hinkley · on July 20, 2023

My comfort level with an architecture is always vastly improved by being able to run a toy version of the entire system on a developer box.

It doesn't just speak well for disaster recovery prospects (both the feasibility of doing it and the density of developers who could possibly pull such a thing off), it's also very, very useful for speculative development.

When you make a high barrier to entry of making large modifications to the system, you also tend to create an underclass of developers, who never really get to understand how the system works.

What if we split these two microservices into three, or combined these three into two? That's a pretty common question, that only gets asked if you know you won't get laughed out of the room for suggesting it.

bamfly · on July 20, 2023

You may enjoy the first episode of James Burke's Connections ("The Trigger Effect"), if you've not seen it.

https://www.youtube.com/watch?v=NcOb3Dilzjc

anotherhue · on July 20, 2023

I enjoyed the one in the Witness but hadn't gotten around to the rest, thanks for the excellent recommendation! https://archive.org/details/james-burke-connections_s01e10

potmat · on July 20, 2023

Still the best non-fiction TV ever produced in my opinion (the whole series I mean).

dredmorbius · on July 30, 2023

The Day the Universe Changed (also by Burke) was also exceptionally good.

Similar premise, but focusing more on philosophy and understanding, and not tracking linearly through time.

There were another two Connections mini-series, though they were IMO not up to the mark of the first. Good, but not truly excellent.

JohnFen · on July 20, 2023

Every new semi fab that comes online was cold-started.

anotherhue · on July 20, 2023

With the output of the prior generations was my point.

alexwasserman · on July 20, 2023

Great read: One Good Turn: A Natural History... https://www.amazon.com/dp/0684867303?ref=ppx_pop_mob_ap_shar...

A history of the screw. Really interesting around how it was developed. Some machining techniques are far older than you’d expect, and some capabilities far newer.

The books thesis is that the screw is the most important invention.

lelandbatey · on July 20, 2023

I love this explanation for why we should make plans even though folks will try to shoot down planning with "no plan survives first contact":

> Even though nothing will go as planned, it's important to have the memory and expertise that did the planning, because that's what's going to be able to think through the as-yet- unknown-unknowns, when the inevitable FUBAR situation suddenly happens later.

We don't plan so that everything will go according to plan, we plan so that we are better equipped to reason when the plan doesn't work.

jacquesm · on July 20, 2023

This is very interesting in the context of power infrastructure as well. As we found out the hard way during the 2003 power blackout in North America.

rdhatt · on July 20, 2023

Practical Engineering did a video on the complexity of bringing a power grid back online, called "black start" (not cold start).

https://practical.engineering/blog/2022/12/5/what-is-a-black...

jsmith45 · on July 20, 2023

Black start of a single plant seems reasonable enough, but black start of a whole grid seems almost absurd.

How could one possibly balance the load with the plants coming online? If the generation and load is too mismatched, the generators can literally automatically trip off the grid, so generation and load must be carefully balanced as things get brought back up.

One would almost need to shed nearly all the loads from the black grid (Which may have happened anyway as the grid collapsed, but any loads not already shed by the collapse could prove interesting), and re-add some some gradually as plants come online, which still seems crazy difficult.

And inrush current demands from many loads as they are get reconnected must be pretty insane.

mjevans · on July 20, 2023

Offhand, that's pretty much the plan for a black start.

Something about bringing up a designated plant and feeding the output over 'cranking' lines that other plants along the route can synchronize their output against. Then gradually adding load and source until the system is meta-stable again.

Edit: additional data

Not only synchronize, but also use for internal needs like all the pumps and particularly the 'excitation current' that establishes the magnetic field for the generator. It allows control over the output voltage. There are also other drawbacks to the more obvious solution of fixed magnets which can be oversimplified as 'ware'.

https://en.wikipedia.org/wiki/Permanent_magnet_synchronous_g...

https://en.wikipedia.org/wiki/Excitation_(magnetic)

https://en.wikipedia.org/wiki/Electric_generator

jsmith45 · on July 20, 2023

The problem is not so much the concept, as how tricky it would be to add the loads back in at just the right rate to not trip some or all the generation back off again. Sure once you have enough generation and load already online, adding the rest is relatively straightforward. Still need to be careful, but after a certain point it would look to utility operators much like the usual work restoring loads and sources after a large area blackout.

The trickiness seems worst close to the very beginning when even relatively small misestimation of a chunk of load being restored would have a proportionally bigger impact. Many loads are not completely predictable, so presumably they would need to favor bringing some of the more stable loads online early so that normal variation from the loads that can only be predicted well in aggregate won't vary enough to trip everything back offline.

rkagerer · on July 20, 2023

Or you can chaos monkey style shut off the continent on a regular basis.

adityaathalye · on July 21, 2023

Well, as of the 2020s, we know we can shut down immense swaths of human activity in multiple continents virtually overnight, and that when we do so, the environment rapidly cleans itself up.

zgluck · on July 20, 2023

So not one mention of terraform/pulumi?