There are no backups. There are no failovers. There is no git. There is no orchestration and deployment stratagies. Programmers ssh into the server and edit code there. Years and years of patchwork on top of patchwork with closely coupled code.
Such is a taste of what needs to be done if you wish to have a service that takes months to set back up after any disruption.
This is a perfect description of how things work at one of the largest health care networks in the northeast US (speaking as someone who works there and keeps saying "where's the automation? where are the procedures?" and keeps being told to shut up, we don't have TIME for that sort of thing.
lol the healthcare industry was definitely in my mind as I wrote this. Never worked there but I read a lot of postmortems and it shows whenever I use their digital products. Recent example is CVS.
Somehow, at some point, they decided that my CVS pharmacy account should be linked to my Mom's extracare. Couldn't find any menu to fix it online. So the next time I went to the register I asked to update it. They read the linked phone number. It was mine. Ok, it is fixed, I think. But then the reciept prints out and it is my mom's Extracare card number. So the next time I press harder. I ask them to read me the card number they have linked from their screen. They read my card number. Ok, it is fixed, I think. But then the reciept prints out and the card number is different—it is my mom's. Then I know the system is incredibly fucked. Being an engineer, I think about how this could happen. I'm guessing there are a hundred database fields where the extracare number is stored, and only one is set to my mom's or something. I poke around the CVS website and find countless different portals made with clearly different frameworks and design practices. Then I know all of CVS's tech looks like this and a disaster is waiting to happen.
Goes like this for a lot of finance as well.
E.g. I can say with confidence that Equifax is still as scuffed as it was back in 2017 when it was hacked. That is a story for another time.
Nobody bothers to keep things clean until it is too late. The features you deliver give promotions, not the potential catastrophes you prevent. Humans have a tendency to be so short sighted, chasing endless earnings beats without anticipating future problems.
Sorry if I phrased it poorly. I wasn’t definitively saying that all these things are the case. But what always is the case is that when an attack takes down an organization for months, it was employing a tremendous number of horrendous practices. My list was supposed to be some.
M&S isn’t down for months because of something innocuous like a full security audit. As a public company losing tens of millions of dollars a week, their only priority is to stop the bleed, even if that means a hasty partial restoration. The fact they can’t even do that suggests they did stuff terribly wrong. There’s an infinite amount of things I didn’t list that could also be the case. Like if Amazon gave them proprietary blobs they lost after the attack and Amazon won’t provide again. But no matter what they are, things were wrong beyond belief. That is a given.
To be fair, I would be that nearly every organization employs a tremendous number of horrendous practices. We only gasp at the ones who get taken down for some reason.
Horrendous practices exist on a spectrum. Every org has bad code that somebody will fix someday™. It is reasonable to expect that after a catostrophic event like this, a full recovery takes some time. But at a "good" org, these practices are isolated. Not every org is entirely held together with masking tape. For the entire thing to be down for so long, the bad practices need to be widespread, seeping into every corner of the product. Ubiquitous.
For instance, when Cloudflare all went down a while ago due to a bad regex, it took less than a hour to rollback the changes. Undoubtably there were bad practices that lead to a regex having the ability to take everything out, but the problem was isolatable and once adressed partial service was quickly restored, and shortly after preventative measures were employed. This bug didn't destroy cloudflare for months.
P.S. in anticipation of the "but cloudflare has SLAs!!" that isn't really a distinction worth making because M&S has an implicit SLA with their customers — they are losing 40 million each week they can't offer service. Plenty of non-b2b companies that invest in quick recovery as well, like Netflix's monkey testing.
No, best practice is that you have a checklist to bring up a copy of your system, better yet that checklist is "run a script". In the cloud age you ought to be able to bring a copy up in a new zone with a repeatable procedure.
Makes a big difference in developer quality of life and improves productivity right away. If you onboard a new dev you give them a checklist and they are up and running that day.
I had a coworker who taught me a lot about sysadmining, (social) networking, and vendor management. She told me that you'd better have your backup procedures tested. One time we were doing a software upgrade and I screwed up and dropped the Oracle database for a production system. She had a mirror in place so we had less than a minute of downtime.
Such is a taste of what needs to be done if you wish to have a service that takes months to set back up after any disruption.