This article seems to indicate that manually triggered failovers will always fail if your application tries to maintain its normal write traffic during that process.
Not that I'm discounting the author's experience, but something doesn't quite add up:
- How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?
- If they know, how is this not an urgent P0 issue for AWS? This seems like the most basic of basic usability features is 100% broken.
- Is there something more nuanced to the failure case here such as does this depend on transactions in-progress? I can see how maybe the failover is waiting for in-flight transactions to close and then maybe hits a timeout where it proceeds with the other part of the failover by accident. That could explain why it doesn't seem like the issue is more widespread.
> How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?
If it's anything like how Azure handles this kind of issue, it's likely "lots of people have experienced it, a restart fixes it so no one cares that much, few have any idea how to figure out a root cause on their own, and the process to find a root cause with the vendor is so painful that no one ever sees it through"
An experience not exclusive to cloud vendors :) Even better when the vendor throws their hands up cause the issue is not reliably repro'able.
That was when I scripted away a test that ran hundreds of times a day on a lower environment, attempting repro. As they say, at scale, even insignificant issues become significant. I don't remember clearly, I think it was a 5-10% chance that the issue triggered.
At least confirming the fix, which we did eventually receive, was mostly a breeze. Had to provide an inordinate amount of captures, logs, and data to get there though. Was quite the grueling few weeks, especially all the office politics laden calls.
I've had customers with load related bugs for years simply because they'd reboot when the problem happened. When dealing with the F100 it seems there is a rather limited number of people in these organizations that can troubleshoot complex issues, that or they lock them away out of sight.
It is a tough bargain to be fair, and it is seen in other places too. From developers copying out their stuff from their local git repo, recloning from remote, then pasting their stuff back, all the way to phone repair just meaning "here's a new device, we synced all your data across for you", it's fairly hard to argue with the economic factors and the effectiveness of this approach at play.
With all the enterprise solutions being distributed, loosely coupled, self-healing, redundant, and fault-tolerant, issues like this essentially just slot in perfectly. Compound this with man-hours (especially expert ones) being a lot harder to justify for any one particular bump in tail latency, and the equation is just really not there for all this.
What gets us specifically to look into things is either the issue being operationally gnarly (e.g. frequent, impacting, or both), or management being swayed enough by principled thinking (or at least pretending to be). I'd imagine it's the same elsewhere. The latter would mostly happen if fixing a given thing becomes an office political concern, or a corporate reputation one. You might wonder if those individual issues ever snowballed into a big one, but turns out human nature takes care of that just "sufficiently enough" before it would manifest "too severely". [0]
Otherwise, you're looking at fixing / RCA'ing / working around someone else's product defect on their behalf, and giving your engineers a "fun challenge". Fun doesn't pay the bills, and we rarely saw much in return from the vendor in exchange for our research. I'd love to entertain the idea that maybe behind closed doors the negotiations went a little better because of these, but for various reasons, I really doubt so in hindsight.
[0] as delightfully subjective as those get of course
Theoretically you're supposed to assign lower prio to issues with known workarounds but then there should also be reporting for product management (which assigns weight by age of first occurrence and total count of similar issues).
Amazon is mature enough for processes to reflect this, so my guess for why something like this could slip through is either too many new feature requests or many more critical issues to resolve.
I'm surprised this hasn't come up more often too. When we worked with AWS on this, they confirmed there was nothing unique about our traffic pattern that would trigger this issue. We also didn't run into this race condition in any of our other regions running similar workloads. What's particularly concerning is that this seems to be a fundamental flaw in Aurora's failover mechanism that could theoretically affect anyone doing manual failover.
If I'm reading this correctly, it sounds like the connection was already using autocommit by default? In that situation, if you initiate a transaction, and then it gets rolled back, you're back in autocommit unless/until you initiate another transaction.
If so, that part is all totally normal and expected. It's just that due to a bug in the Python client library (16 years ago), the rollback was happening silently because the error was not surfaced properly by the client library.
Is there any scenario in a sane world where a transaction ceases to be in scope just because it went into an error state? I'd have expect the client to send an explicit ROLLBACK when they realize a transaction is in an error state, not for the server to end it and just notify the client. This is how psql appears to the end user.
postgres=# begin;
BEGIN
postgres=*# bork;
ERROR: syntax error at or near "bork"
LINE 1: bork;
^
postgres=!# select 1;
ERROR: current transaction is aborted, commands ignored until end of transaction block
postgres=!# rollback;
ROLLBACK
postgres=# select 1;
?column?
----------
1
(1 row)
postgres=#
Every DBMS handles errors slightly differently. In a sane world you shouldn't ever ignore errors from the database. It's unfortunate to hear that a Python MySQL client library had a bug that failed to expose errors properly in one specific situation 16 years ago, but that's not terribly relevant to today.
Postgres behavior with errors isn't even necessarily desirable -- in terms of ergonomics, why should every typo in an interactive session require me to start my transaction over from scratch?
No, the situation described upthread is about a deadlock error, not a typo. In MySQL, syntax errors throw a statement-level error but it does not affect the transaction state in any way. If you were in a transaction, you're still in a transaction after a typo.
With deadlocks in MySQL, the error you receive is "Deadlock found when trying to get lock; try restarting transaction" i.e. it specifically tells you the transaction needs to start over in that situation.
In programmatic contexts, transactions are typically represented by an object/struct type, and a correctly-implemented database driver for MySQL handles this properly and invalidates the transaction object as appropriate if the error dictates it. So this isn't really even a common foot-gun in practical terms.
What do you mean? Autocommit mode is the default mode in Postgres and MS SQL Server as well. This is by no means a MySQL-specific behavior!
When you're in autocommit mode, BEGIN starts an explicit transaction, but after that transaction (either COMMIT or ROLLBACK), you return to autocommit mode.
The situation being described upthread is a case where a transaction was started, and then rolled back by the server due to deadlock error. So it's totally normal that you're back in autocommit mode after the rollback. Most DBMS handle this identically.
The bug described was entirely in the client library failing to surface the deadlock error. There's simply no autocommit-related bug as it was described.
Lack of autocommit would be bad for performance at scale, since it would add latency to every single query. And the MVCC implications are non-trivial, especially for interactive queries (human taking their time typing) while using REPEATABLE READ isolation or stronger... every interactive query would effectively disrupt purge/vacuum until the user commits. And as the sibling comment noted, that would be quite harmful if the user completely forgets to commit, which is common.
In any case, that's a subjective opinion on database design, not a bug. Anyway it's fairly tangential to the client library bug described up-thread.
Autocommit mode is pretty handy for ad-hoc queries at least. You wouldn't want to have to remember to close the transaction since keeping a transaction open is often really bad for the DB
My experience with AWS is that they are extremely, extremely parsimonious about any information they give out. It is near-impossible to get them to give you any details about what is happening beyond the level of their API. So my gut hunch is that they think that there's something very rare about this happening, but they refuse to give the article writer the information that might or might not help them avoid the bug.
If you pay for the highest level of support you will get extremely good support. But it comes with signing a NDA so you're not going to read about anything coming out of it on a blog.
I've had AWS engineers confirm very detailed and specific technical implementation details many many times. But these were at companies that happily spent over a $1M/year with AWS.
Nah if your monthly spend is really significant than you will get good support and issues you care about will get prioritized. Going from startup with 50K/month spend to a large company with untold millions per month spend experience is night and day. We have Dev managers and eng. from key AWS teams present in meetings when need be, we get issues we raise prioritized and added to dev roadmaps etc.
Yeah I agree, this seems like a pretty critical feature to the Aurora product itself. We saw a similar behavior recently where we had a connection pooler in between which indicates something wrong with how they propagate DNS changes during the failover. wtf aws
Whenever we have to do any type of AWS Aurora or RDS cluster modification in prod we always have the entire emergency response crew standing by right outside the door.
Their docs are not good and things frequently don't behave how you expect them to.
The article is low quality. It does not mention which Aurora PostgreSQL version
was involved, and it provides no real detail about how the staging environment
differed from production, only saying that staging “didn’t reproduce the exact conditions,” which is not actionable.
“Amazon Aurora PostgreSQL updates”: under Aurora PostgreSQL 17.5.3, September 16, 2025 – Critical stability enhancements includes a potential match:
“...Fixed a race condition where an old writer instance may not step down after a new writer instance is promoted and continues to write…”
If that is the underlying issue, it would be serious, but without more specifics
we can’t draw conclusions.
For context: I do not work for AWS, but I do run several production systems on Aurora PostgreSQL. I will try to reproduce this using the latest versions over the next few hours. If I do not post an update within 24 hours, assume my tests did not surface anything.
That would not rule out a real issue in certain edge cases, configurations, or version combinations but it would at least suggest it is not broadly reproducible.
We're running Aurora PostgreSQL 15.12, which includes the fix mentioned in the release notes. Looking at this comment and the AWS documentation, I think there's an important distinction to make about what was actually fixed in Aurora PostgreSQL 15.12.4. Based on our experience and analysis, we believe AWS's fix primarily focused on data protection rather than eliminating the race condition itself.
Here's what we think is happening:
Before the fix (pre-15.12.4):
1. Failover starts
2. Both instances accept and process writes simultaneously
3. Failover eventually completes after the writer steps down
4. Result: Potential data consistency issues ???
After the fix (15.12.4+):
1. Failover starts
2. If the old writer doesn't demote before the new writer is promoted, the storage layer now detects this and rejects write requests
3. Both instances restart/crash
4. Failover fails or requires manual intervention
The underlying race condition between writer demotion and reader promotion still exists - AWS just added a safety mechanism at the storage layer to prevent the dangerous scenario of two writers operating simultaneously. They essentially converted a data inconsistency risk into an availability issue.
This would explain why we're still seeing failover failures on 15.12 - the race condition wasn't eliminated, just made safer.
The comment in the release notes about "fixed a race condition where an old writer instance may not step down" is somewhat misleading - it's more accurate to say they "mitigated the consequences of the race condition" by having the storage layer reject writes when it detects the problematic state and that is probably why AWS Support did not point us to this release when we raised the issue.
fwiw we haven't seen issues manually doing manual failovers for maintenance using the same/similar procedure described in the article. I imagine there is something more nuanced here and it's hard to draw too many conclusions without a lot more details being provided by AWS
It sounds like part of the problem was how the application reacted to the reverted fail over. They had to restart their service to get writes to be accepted, implying some sort of broken caching behavior where it kept trying to send queries to the wrong primary.
It's at least possible that this sort of aborted failover happens a fair amount, but if there's no downtime then users just try again and it succeeds, so they never bother complaining to AWS. Unless AWS is specifically monitoring for it, they might be blind to it happening.
it could be most people pause writes because its going to create errors if you try and execute a write against an instance that refuses to accept and writes, and for some people those errors might not be recoverable. so they just have some option in their application that puts the application into maintenance mode where it will hard reject writes at the application layer.
P0 if it happens to everyone, right? Like the USE1 outage recently. If it is 0.001% of customers (enough to get a HN story) is may not be that high. Maybe this customer is on a migration or upgrade path under the hood. Or just on a bad unit in the rack.
Although the article has an SEO-optimized vibe, I think it's reasonable to take it as true until refuted. My rule of thumb is that any rarely executed, very tricky operation (e.g. database writer fail over) is likely to not work because there are too many variables in play and way too few opportunities to find and fix bugs. So the overall story sounds very plausible to me. It has a feel of: it doesn't work under continuous heavy write load, in combination with some set of hardware performance parameters that plays badly with some arbitrary time out.
Note that the system didn't actually fail. It just didn't process the fail over operation. It reverted to the original configuration and afaics preserved data.
The skill of a plumber isn't in knowing how to solder. I've run copper pipe and soldered fittings on my own plenty of times, it's not hard. The skill is in either knowing the building codes inside and out when dealing with new work, or for remodel work it's knowing all the tricks of how to alter existing plumbing quickly, cleanly and efficiently. Both those skills are only developed with experience.
I agree. The comments in this thread don't seem to understand how skill trade work operates at all.
One truism is that any halfway decent tradesman always has about 10x the amount of work waiting for them as they can accomplish. People who can do the job well are always in demand.
Uber for plumbers would be a disaster because the only plumbers who have the free time for waiting around for on-call work are the ones who are terrible at their job.
I figure evolution will solve that. The kind of people who don’t have kids while living in prosperity will die out. The ones who reproduce will stick around.
We’ll build mirror life to assist us so we keep not needing children before evolution has a chance to fix anything. I postulate it is coming this century.
This is only a problem when you look at the micro level of cultures or individual states. Sure, some culture may die out, but that's been happening forever.
There's 8 billion humans on this planet, and we're still fucking like we always have been. The human race will be safe from prosperity.
Humans will number 10 to 11 billion before the curve starts pointing downward. Even China, the supposed basket case of population collapse will "collapse" to their level of a few decades ago. The current population was supposed to be catastrophically overpopulated.
I don't agree with them but there are significant numbers of people who think 10 or 11 billion is way beyond sustainability.
Population decline is predicted or currently happening in some poor countries too. It's not a prosperity driven effect. Children don't die young anymore even in poor countries. There's just generally less pressure to spawn your own gang of supporters. Elon excepted I guess.
"Prosperity" implies that the problem is folks smart enough to not have children beyond the means to raise them into a similar or better lifestyle.
I prefer "Precarity" induced fertility collapse. Down here in the mud I guess I have a different view with my 1 child and wife with a heart condition who would likely die from a 2nd pregnancy.
Doubt it. Especially when you realize the cost to the company for an employee is much more than just take-home salary. Healthcare, employer payroll taxes & such all add up. You could also argue wether deferred comp like stock options & RSUs are calculated as the cost. The employee's "comp package" often comes in at 2x or more of their base salary.
I would say that with a computer you're using a tool to take care of mundane details and speed up the mechanics of tasks in your life. Such as writing a document, or playing a game. I can't think of a way I would be seriously disadvantaged by not having the ability to hand-write an essay or have games I can readily play without a computer. Computers are more like tools in the way a hammer is a tool. I don't mind being totally dependent on a computer for those tasks in the same way I don't mind that I need a hammer anytime I want to drive a nail.
But for many people, LLMs replace critical thinking. They offer the allure of outsourcing planning, research, and generating ideas. These skills seem more fundamental to me, and I would say there's definitely a loss somehow of one's humanity if you let those things atrophy to the point you become utterly dependent on LLMs.
>But for many people, LLMs replace critical thinking...[and] outsourc[e] planning, research, and generating ideas
Sure, but I guess you could say that any tech advancement outsources these things, right? I don't have to think about what gear to pick when I drive a car to maximize its performance, I don't have to think about "i before e" types of rules when spell check will catch it, I don't have to think about how to maintain a draft horse or think as much about types of dirt or terrain difficulties when I have a tractor.
Or, to add another analogy, for something like a digital photo compared to film photography that you'd develop yourself or portrait painting before that: so much planning and critical thought has been lost.
And then there's another angle: does a project lead not outsource much of this to other people? This invites a "something human is being lost" critique in a social/developmental context, but people don't really lament that the CEO has somehow lost his humanity because he's outsourcing so much of the process to others.
I'm not trying to be clever or do gotchas or anything. I'm genuinely wrestling with this stuff. Because you might be right: dependence on LLMs might be bad. (Though I'd suggest that this critique is blunted if we're able to eventually move to hosting and running this stuff locally.) But I'm already dependent on a ton of tech in ways I probably can't even fully grasp.
What would be the point? Ships are complex machines and the ocean is an incredibly harsh and unforgiving environment. You would need crew to simply maintain the ship while underway and fix things that break (which pretty much always happens).
Next, these ships can be downright dangerous for other, smaller ships. It has happened before that large cargo ships have killed people on sailboats when they didn't keep an adequate lookout. I wouldn't trust an autonomous system to have enough accuracy to avoid running over other boats.
No, it's already a solved problem. For instance newspapers moderate and approve all content that they print. While some bad actors may be able to sneak scams in through classifieds, the local community has a direct way to contact the moderators and provide feedback.
The answer is that it just takes a lot of people. What if no content could appear on Facebook until it passed a human moderation process?
As the above poster said, this is not profitable which is why they don't do it. Instead they complain about how hard it is to do programmatically and keep promising they will get it working soon.
A well functioning society would censure them. We should say that they're not allowed to operate in this broken way until they solve the problem. Fix first.
Big tech knows this which is why they are suddenly so politically active. They reap billions in profit by dumping the negative externalities onto society. They're extracting that value at a cost to all of us. The only hope they have to keep operating this way is to forestall regulation.
> The answer is that it just takes a lot of people.
The more of those people you hire, the higher the chance that a bad actor will slip through and push malicious things through for a fee. If the scammer has a good enough system, they'll do this one time with one person and then move on to the next one, so now you need to verify that all your verifiers are in fact perfect in their adherence to the rules. Now you need a verification system for your verification system, which will eventually need a verification system^3 for the verification system^2, ad infinitum.
This is simply not true in every single domain. The fact people think tech is different doesn't mean it necessarily is. It might just mean they want to believe it's different.
At the end of the day, I can't make an ad and put it on a billboard pretending to be JP Morgan and Chase. I just can't.
> This is simply not true in every single domain. The fact people think tech is different doesn't mean it necessarily is. It might just mean they want to believe it's different.
Worldwide and over history, this behaviour has been observed in elections (gerrymandering), police forces (investigating complaints against themselves), regulatory bodies (Boeing staff helping the FAA decide how airworthy Boeing planes are), academia (who decides what gets into prestigious journals), newspapers (who owns them, who funds them with advertisements, who regulates them), and broadcasts (ditto).
> At the end of the day, I can't make an ad and put it on a billboard pretending to be JP Morgan and Chase. I just can't.
JP Morgan and Chase would sue you after the fact if they didn't like it.
Unless the owners of the billboard already had a direct relationship with JP Morgan and Chase, they wouldn't have much of a way to tell in advance. If they do already have a relationship with JP Morgan and Chase, they may deny the use of the billboard for legal adverts that are critical of JP Morgan and Chase and their business interests.
The same applies to web ads, the primary difference being each ad is bid on in the first blink of an eye of the page opening in your browser, and this makes it hard to gather evidence.
> The more of those people you hire, the higher the chance that a bad actor will slip through and push malicious things through for a fee.
Again, the newspaper model already solves this. Moderation should be highly localized, from the communities for which they are moderating the content. That maximizes the chance that the moderator's values will align with the community. Small groups are harder to hide bad actors, especially when you can be named and shamed by people that you see every day. Managers and their coworkers and the community itself are the "verifiers."
Again, this model has worked since the beginning of time and it's 1000x better than what FB has now.
> What if no content could appear on Facebook until it passed a human moderation process?
While I'd be just fine with Meta, X etc. (even YouTube, LinkedIn, and GitHub!) shutting down because the cost of following the law turned out to be too expensive, what you suggest here also has both false positives and false negatives.
False negatives: Polari (and other cants) existed to sneak past humans.
False positives: humans frequently misunderstand innocent uses of jargon as signs of malfeasance, e.g. vague memories of a screenshot from ages ago where someone accidentally opened the web browser's dev console while on Facebook, saw messages about "child elements" being "killed", freaked out.
> The answer is that it just takes a lot of people. What if no content could appear on Facebook until it passed a human moderation process?
A lot of people = a lot of cost. That would probably settle out lower than the old classified ads, but paying even a dollar per Facebook post would be a radically different use than the present situation.
And of course you'd end up with a ban of some sort on all smaller forums and BBS that couldn't maintain compliance requirements.
Study after study finds that a health care model where you can visit a doctor frequently leads to much better overall heath for people.
The system we have here forces people to wait until minor issues turn into life or death situations that require much more intensive and expensive care.
Let's suppose that a doctor's visit costs $200 for someone without insurance. And let's say that the two options are 1) insurance premiums are $1000/month, but it's only a $20 copay to visit the doctor, and 2) major-only insurance at $100/month. I can visit the doctor pretty regularly on that difference of $900/month.
All this hypothetical tells us is that you're young, healthy, single, and have a good income. Which is exactly the issue with our health system -- it only works when you don't get sick.
The median household income in the US is $83k. That's for a whole family. I would challenge you to come up with a monthly budget for four people that can support anything like $1000 a month for insurance (which for a family is actually going to be more like $2000) OR handle multiple $200 doctor visits per month. And mind you there is no such things as a doctor visit that costs only $200 unless you're talking about a routine physical. Because the first thing that happens when you're sick is the doctor starts ordering tests and referring you to specialists. And let's hope nobody needs a prescription!
And then you find what life looks like for 150 million Americans -- you're constantly putting off healthcare until it becomes an emergency. You're gambling with your own life and the life of your children trying to not go bankrupt.
https://www.beliefnet.com/news/2003/09/the-gospel-of-supply-...
reply