This line of "what if an asteroid hits the primary & DR data centers in the same microsecond" thinking is why we settled on running our product on 1 VM with SQLite in-proc.
After taking our customers through this same kind of apocalyptic rabbit hole conversation, they tend to agree with this architecture decision.
The cost of anticipating the .00001% that might never come is completely drowned out by the massive, daily 99%-certain headache that is managing a convoluted, multi-cloud cluster.
Many times the business owners will get the message and finally reveal that they have always had access to a completely ridiculous workaround involving literal paper & pen that is just as feasible in 2023 as it was in the 18th century.
or the dev who spent a few nights and weekends rescuing the system after one of those 1% failures the customer, as it turns out, has no patience for at all
Disaster recovery is just one of many things that is much simpler in non-distributed systems.
You seem to be confusing a system that produces bad results 1% of the time with a system that's down 1% of the time. If you can only write the first kind of non-distributed system, you're in for a bad trip if you try to write a distributed equivalent.
After taking our customers through this same kind of apocalyptic rabbit hole conversation, they tend to agree with this architecture decision.
The cost of anticipating the .00001% that might never come is completely drowned out by the massive, daily 99%-certain headache that is managing a convoluted, multi-cloud cluster.
Many times the business owners will get the message and finally reveal that they have always had access to a completely ridiculous workaround involving literal paper & pen that is just as feasible in 2023 as it was in the 18th century.