Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The title is pretty misleading. They're not even running Postgres, but AWS Aurora, which is Postgres compatible, but is not Postgres.

Also, pausing queries does count as downtime. The system was unavailable for that period of time.




> The title is pretty misleading. They're not even running Postgres, but AWS Aurora, which is Postgres compatible, but is not Postgres.

For what it's worth, every command ran works on normal Postgres. Hence we didn't think it mattered to mention Aurora specifically in the title.

> Also, pausing queries does count as downtime.

If a query takes a bit longer to respond, I don't think that counts as downtime. From the perspective of the user, they couldn't distinguish this migration event from some blip of slightly slower queries.


> If a query takes a bit longer to respond, I don't think that counts as downtime. From the perspective of the user, they couldn't distinguish this migration event from some blip of slightly slower queries.

It comes down to defining Service Level Objectives (SLOs) that are meaningful to your users. For one system I worked on, latency was important, and so one SLO was "99.999% of <a certain class of> requests with a deadline >=1s should succeed with latency <1s", so if this affected more than 0.0001% of requests in <time interval defined in our SLO>, we'd have called it an outage. But I've also worked on systems with looser SLOs where this would have been fine.


Not only that but I think you also need to take upstream systems into account. With a reasonably robust frontend that handles transient issues and retries reasonably, I think it's ok to say "no downtime"


Completely depends on what the "user" is. Are they a human, or a machine that explicitly requires timings within a particular threshold?


It depends if it feels like an outage


> If a query takes a bit longer to respond, I don't think that counts as downtime

"We're sorry that your query took 7 hours to be responded to, but it wasn't an outage - honest"


We would count 7 hours as downtime too. Our pause was less than 5 seconds.


Nice job, then! Technical downtime that’s virtually undetectable to users is a big win. In fact, “less than 5 seconds of downtime” in the title would actually make me want to read the article more as I tend to be suspicious of “zero downtime” claims for database upgrades, whereas <5s is clearly almost as good as zero and actually quantified :)


On the other than "less than 5 seconds of downtime" might give the impression that new queries sent within that time period would be rejected, while zero implies this doesn't happen, i.e. that it's undistinguishable from normal operation for the client.

And being even more precise in the title would just make it less titley :).


Yeah - a quantifiable amount in the headline would change the likelihood of the article being taken seriously - it goes from "No downtime? I call BS" to "Less than 5 seconds, that seems reasonable, and worth investigating"


Less than 5 seconds seems pretty reasonable to me to call it zero down time.


5 seconds pause on queries would make our app server drop connections and throw errors under cyclical high load - which would result in a incident.


Strong energy of "someone brushed up against me and that's assault" going on here


AWS Aurora Postgres is a forked Postgres with a different storage engine. Sure you are technically correct, but there are many things called "Postgres compatible" that are very much less Postgres that AWS Aurora Postgres (like for example CockroachDB).


Iirc AWS explicitly calls out they still use upstream Postgres query engine and some other parts. It very much _is_ Postgres but not 100% pure upstream Postgres.


Yep, for example that is how they advertise protocol, feature and language compatibility.


They reduced their potential downtime from 60s to what I assume is only a few seconds (they don't state in the article).

If there is not noticeable user impact or unavailability of services (this is unique to each service in existence) then there is no downtime.


> they don't state in the article

Thank you for pointing this out. I updated the essay to mention how long the pause took explicitly:

After about a 3.5 second pause [^13], the failover function completed smoothly! We had a new Postgres instance serving requests, and best of all, nobody noticed.

[^13]: About 2.5 seconds to let active queries complete, and about 1 second for the replica to catch up


What is the [^13] notation? Is it different than a *?


They copy/pasted from the article, that's how they're formatting footnote links. Article has 15 footnotes and that's number 13.


> They're not even running Postgres, but AWS Aurora

But everything described is also PostgreSQL compatible.

> downtime

Context switching pauses execution too FYI.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: