Did you test the "stop the world" approach? I wonder how the write downtime comp...

Did you test the "stop the world" approach? I wonder how the write downtime compares. It seems the 1 second of replication lag is unavoidable. The arbitrary 2.5 seconds of waiting for txns to finish could be removed by just killing all running txns, which your new approach does for txns longer than 2.5 seconds already.

> ;; 2. Give existing transactions 2.5 seconds to complete.

> (Thread/sleep 2500)

> ;; Cancel the rest

> (sql/cancel-in-progress sql/default-statement-tracker)

Then you have 2.5 seconds less downtime and I think you can avoid the problem of holding all connections on one big machine.

> Our switching algorithm hinges on being able to control all active connections. If you have tons of machines, how could you control all active connections?

> Well, since our throughput was still modest, we could temporarily scale our sync servers down to just one giant machine

> In December we were able to scale down to one big machine. We’re approaching the limits to one big machine today. [15] We’re going to try to evolve this into a kind of two-phase-commit, where each machine reports their stage, and a coordinator progresses when all machines hit the same stage.

I guess it depends on what your SLO is. With your approach only clients with txns longer than 2.5 seconds started before the upgrade see them fail, whereas with the "stop the world" approach there would be a period lower-bounded by the replication lag time where all txns fail.

Cool work thanks for sharing!

Edit: I feel like a relevant question regarding the SLO I'm not considering is how txns make their way from your customers to your DB? Do your customers make requests to your API and your application servers send txns to your Postgres instance? I think then you could set up a reasonable retry policy in your application code and use the "stop the world" approach and once your DB is available again the retries succeed. Then your customers never see any txns fail (even the long-running ones) and just a slight increase in latency. If you are worried about retrying in cases that are not related to this upgrade you could change the configuration of your retry policy shortly before/after the upgrade. Or return an error code specific to this scenario so your retry code knows.

Then you get the best of both worlds: no downtime perceivable to customers, no waiting for 2.5 seconds, and you don't have to write a two-phase-commit approach for it to scale.

If your customer sends txns to your Postgres instance directly, this wouldn't work I think.