> The title is pretty misleading. They're not even running Postgres, but AWS Aurora, which is Postgres compatible, but is not Postgres.
For what it's worth, every command ran works on normal Postgres. Hence we didn't think it mattered to mention Aurora specifically in the title.
> Also, pausing queries does count as downtime.
If a query takes a bit longer to respond, I don't think that counts as downtime. From the perspective of the user, they couldn't distinguish this migration event from some blip of slightly slower queries.
> If a query takes a bit longer to respond, I don't think that counts as downtime. From the perspective of the user, they couldn't distinguish this migration event from some blip of slightly slower queries.
It comes down to defining Service Level Objectives (SLOs) that are meaningful to your users. For one system I worked on, latency was important, and so one SLO was "99.999% of <a certain class of> requests with a deadline >=1s should succeed with latency <1s", so if this affected more than 0.0001% of requests in <time interval defined in our SLO>, we'd have called it an outage. But I've also worked on systems with looser SLOs where this would have been fine.
Not only that but I think you also need to take upstream systems into account. With a reasonably robust frontend that handles transient issues and retries reasonably, I think it's ok to say "no downtime"
Nice job, then! Technical downtime that’s virtually undetectable to users is a big win. In fact, “less than 5 seconds of downtime” in the title would actually make me want to read the article more as I tend to be suspicious of “zero downtime” claims for database upgrades, whereas <5s is clearly almost as good as zero and actually quantified :)
On the other than "less than 5 seconds of downtime" might give the impression that new queries sent within that time period would be rejected, while zero implies this doesn't happen, i.e. that it's undistinguishable from normal operation for the client.
And being even more precise in the title would just make it less titley :).
Yeah - a quantifiable amount in the headline would change the likelihood of the article being taken seriously - it goes from "No downtime? I call BS" to "Less than 5 seconds, that seems reasonable, and worth investigating"
AWS Aurora Postgres is a forked Postgres with a different storage engine. Sure you are technically correct, but there are many things called "Postgres compatible" that are very much less Postgres that AWS Aurora Postgres (like for example CockroachDB).
Iirc AWS explicitly calls out they still use upstream Postgres query engine and some other parts. It very much _is_ Postgres but not 100% pure upstream Postgres.
Thank you for pointing this out. I updated the essay to mention how long the pause took explicitly:
After about a 3.5 second pause [^13], the failover function completed smoothly! We had a new Postgres instance serving requests, and best of all, nobody noticed.
[^13]: About 2.5 seconds to let active queries complete, and about 1 second for the replica to catch up
Also, pausing queries does count as downtime. The system was unavailable for that period of time.