The title is pretty misleading. They're not even running Postgres, but AWS Auror...

stopachka · 2025-01-29T21:04:21 1738184661

> The title is pretty misleading. They're not even running Postgres, but AWS Aurora, which is Postgres compatible, but is not Postgres.

For what it's worth, every command ran works on normal Postgres. Hence we didn't think it mattered to mention Aurora specifically in the title.

> Also, pausing queries does count as downtime.

If a query takes a bit longer to respond, I don't think that counts as downtime. From the perspective of the user, they couldn't distinguish this migration event from some blip of slightly slower queries.

scottlamb · 2025-01-29T22:23:20 1738189400

> If a query takes a bit longer to respond, I don't think that counts as downtime. From the perspective of the user, they couldn't distinguish this migration event from some blip of slightly slower queries.

It comes down to defining Service Level Objectives (SLOs) that are meaningful to your users. For one system I worked on, latency was important, and so one SLO was "99.999% of <a certain class of> requests with a deadline >=1s should succeed with latency <1s", so if this affected more than 0.0001% of requests in <time interval defined in our SLO>, we'd have called it an outage. But I've also worked on systems with looser SLOs where this would have been fine.

nijave · 2025-01-29T22:31:19 1738189879

Not only that but I think you also need to take upstream systems into account. With a reasonably robust frontend that handles transient issues and retries reasonably, I think it's ok to say "no downtime"

RadiozRadioz · 2025-01-29T21:41:31 1738186891

Completely depends on what the "user" is. Are they a human, or a machine that explicitly requires timings within a particular threshold?

lionkor · 2025-01-29T21:16:29 1738185389

It depends if it feels like an outage

awesome_dude · 2025-01-29T21:30:02 1738186202

> If a query takes a bit longer to respond, I don't think that counts as downtime

"We're sorry that your query took 7 hours to be responded to, but it wasn't an outage - honest"

stopachka · 2025-01-29T21:49:29 1738187369

We would count 7 hours as downtime too. Our pause was less than 5 seconds.

libraryofbabel · 2025-01-29T22:16:14 1738188974

Nice job, then! Technical downtime that’s virtually undetectable to users is a big win. In fact, “less than 5 seconds of downtime” in the title would actually make me want to read the article more as I tend to be suspicious of “zero downtime” claims for database upgrades, whereas <5s is clearly almost as good as zero and actually quantified :)

_flux · 2025-01-30T09:58:16 1738231096

On the other than "less than 5 seconds of downtime" might give the impression that new queries sent within that time period would be rejected, while zero implies this doesn't happen, i.e. that it's undistinguishable from normal operation for the client.

And being even more precise in the title would just make it less titley :).

awesome_dude · 2025-01-29T22:34:03 1738190043

Yeah - a quantifiable amount in the headline would change the likelihood of the article being taken seriously - it goes from "No downtime? I call BS" to "Less than 5 seconds, that seems reasonable, and worth investigating"

ElijahLynn · 2025-01-29T21:54:32 1738187672

Less than 5 seconds seems pretty reasonable to me to call it zero down time.

tossandthrow · 2025-01-29T22:06:31 1738188391

5 seconds pause on queries would make our app server drop connections and throw errors under cyclical high load - which would result in a incident.

paulddraper · 2025-01-29T23:04:49 1738191889

Strong energy of "someone brushed up against me and that's assault" going on here

SahAssar · 2025-01-29T22:20:05 1738189205

AWS Aurora Postgres is a forked Postgres with a different storage engine. Sure you are technically correct, but there are many things called "Postgres compatible" that are very much less Postgres that AWS Aurora Postgres (like for example CockroachDB).

nijave · 2025-01-29T22:29:00 1738189740

Iirc AWS explicitly calls out they still use upstream Postgres query engine and some other parts. It very much _is_ Postgres but not 100% pure upstream Postgres.

SahAssar · 2025-01-29T22:43:10 1738190590

Yep, for example that is how they advertise protocol, feature and language compatibility.

unethical_ban · 2025-01-29T21:50:13 1738187413

They reduced their potential downtime from 60s to what I assume is only a few seconds (they don't state in the article).

If there is not noticeable user impact or unavailability of services (this is unique to each service in existence) then there is no downtime.

stopachka · 2025-01-29T22:03:08 1738188188

> they don't state in the article

Thank you for pointing this out. I updated the essay to mention how long the pause took explicitly:

After about a 3.5 second pause [^13], the failover function completed smoothly! We had a new Postgres instance serving requests, and best of all, nobody noticed.

[^13]: About 2.5 seconds to let active queries complete, and about 1 second for the replica to catch up

metadat · 2025-01-30T02:53:58 1738205638

What is the [^13] notation? Is it different than a *?

Izkata · 2025-01-30T04:06:57 1738210017

They copy/pasted from the article, that's how they're formatting footnote links. Article has 15 footnotes and that's number 13.

paulddraper · 2025-01-29T21:38:19 1738186699

> They're not even running Postgres, but AWS Aurora

But everything described is also PostgreSQL compatible.

> downtime

Context switching pauses execution too FYI.