> I think in particular because it is hard to pick and choose availability I mea...

> I think in particular because it is hard to pick and choose availability

I mean, human error is 95% of all outages, intentional changes to the running system and how it operates. You definitely can choose when you do those. Hardware or system failures outside of that are exceedingly rare, though I do submit they happen; simple systems are easier to stand up again if they fall over in the event that does happen.

I mean; Stopping a database for a backup is a prime example of something that's significantly harder to do when you're running it 24/7 365.

Similarly, performing a host migration, or running a potentially dangerous major version upgrade.

What we do normally is limit the cost of rolling back, which is nice.

But, instead of "iterate quickly and if it causes an outage we roll back" it can easily be: one guy stays a little bit late and upgrades the server and if it doesn't work then they roll back. -- impacting a much more limited set of people.

Also: if you work globally, you can run upgrades on your edges.

Also also: 20% unavailability is just an extreme case. I have 99.997% uptime with a single host machine that lives on a shelf in my room. I'm not saying this is how you should run systems, but it's pretty normal for even single nodes to have insanely high availability out of the box. Bonus: if it does have an outage, because it's so simple a restore takes 23 minutes.

I know this because I run restores on another machine pretty often, and if I run a backup I usually run blue/green between these boxes.