Moreover no manager gets dinged for "internet-wide" outages unfortunately, so the compliance department keeps calling the shots. The amount of times I've had to explain there's no added security in adding an "antivirus" to our linux servers as we already have proper monitoring at eBPF level is annoying.
I'd be fired if I caused enough loss in revenue to pay my own salary for a year.
I am responsible for my choices. I'm CTO, I don't doubt that in some cases execs cover for each other, but at least I have anecdotal experience of what it would take for me to be fired- and this is clearly communicated to me.
Hope you get paid a lot! Otherwise you are either in a very young or very stupid job.
I regularly spend multiples of my salary every month on various commitments my company makes, any small mistake could easily mean that its multiples of my salary type of problem within 10 days.
A friend of mine spent half a million on a storage device that we never used. It sat in the IT area for years until we were acquired. Everyone gave him so much shit. Finance asked me about it numerous times (going around my friend the CTO) so they could properly depreciate it. He didn't get dinged by the board at all. It remained an open secret. We were making million dollar decisions once a month, though.
> I regularly spend multiples of my salary every month on various commitments my company makes.
Yeah, same here.
But if I choose a vendor and that vendor fails us so catastrophically as to make us financially insolvent, then it's my job to have run a risk analysis and to have an answer for why.
If it's more cost effective to take an outage, that's fine, if it's not: then why didn't I have a DRP in place, why did we rely so much on one vendor, what's the exposure.
It's a pretty important part of being a serious business person.
Sure, but that's not what I said or you said, and my commentary was about relative measures of your salary to your budget.
If you can't make a mistake of your salary size in your budget then your budget is small or very tight, most corporations fuck up big multiples of their CTOs salary quarterly (but that turns out to be single digit percentage points of anything useful.)
> I'd be fired if I caused enough loss in revenue to pay my own salary for a year.
I'm not so sure.
I know of a major company that had a glitch, multiple times, that caused them to lose about ~15 million dollars at least once (a non-prod test hit prod because of a poorly designed too).
I was told the decision-makers decided not to fix the problem (the risk of losing more money again) because the "money had already been lost."
"no manager gets dinged for "internet-wide" outages"
Kind of like, nobody gets fired for hiring IBM, or using SAP. They are just so big, every manager can say, "look how many people are using them, how was I supposed to know they are crap".
But, seems like for uptime, someone should be identifiable. If your job is uptime, and there is a world wide outage, I'd think it would roll down hill onto someone.
> Kind of like, nobody gets fired for hiring IBM, or using SAP. They are just so big, every manager can say, "look how many people are using them, how was I supposed to know they are crap".
I wouldn't necessarily say IBM or SAP are "crap". It's much more likely that orgs buying into IBM or SAP don't the due diligence on what the true costs to properly set it up and keep it running, therefore cut tons of corners.
They basically want to own a Ferrari and when it comes to maintenance, they want run Regular gas and try to get their local mechanic to slap Ford parts on it because its too expensive to keep going back to the dealership.
The thing is usually this argument goes something like this:
A: Should prod be running a failover / <insert other safety mechanism>?
B: Yes!
A: This is how much it costs: <number>
B: Errm... Let me check... OK I got an answer, let's document how we'd do it, but we can't afford the overhead of an auto-failover setup.
And so then there will be 2 types of companies, the ones that "do it properly" will have more costs, their margins will be lower, over time they'll be less successful as long as no big incident happens. When a big incident happens though, for most businesses - recent history proves that if everyone was down, nobody really complains. If your customers have 1 vendor down due to this issue, they will complain, but if your customers have 10 vendors down, and are themselves down, they don't complain anymore. And so you get this tragedy of the commons type dynamic where it pays off to do what most people do rather than the right thing.
And the thing is, in practice, doing the thing most people do is probably not a bad yardstick - however disappointing that is. 20 years ago nobody had 2FA and it was acceptable, today most sites do and it's not acceptable anymore not to have it.
Parents may teach this to kids but the kids usually notice their parents don't practice what they preach. So they don't either.
The world is filled with people following everybody else off a cliff. If you're warning people or even just not playing along in a time of great hysteria, people at best ignore your warnings and direct verbal abuse at you. At worst, you can face active persecution for being right when the crowd has gone insane. So most people are cowards who go along to get along.