It's not mentioned in the headline and not made super clear in the article: This is specific to multi-AZ clusters, which is a relatively new feature of RDS, and differ from multi-AZ instance that most will be familiar with. (Clear as mud.)
Multi-AZ instances is a long-standing feature of RDS where the primary DB is synchronously replicated to a secondary DB in another AZ. On failure of the primary, RDS fails over to the secondary.
Multi-AZ clusters has two secondaries, and transactions are synchronously replicated to at least one of them. This is more robust than multi-AZ instances if a secondary fails or is degraded. It also allows read-only access to the secondaries.
Multi-AZ clusters no doubt have more "magic" under the hood, as its not a vanilla Postgres feature as far as I'm aware. I imagine this is why it's failing the Jepsen test.
Interesting why this magic would be needed. Vanilla Postgres does support quorum commit which can do this. You can also set up the equivalent multi-AZ cluster with Patroni, and (modulo bugs) it does the necessary coordination to make sure to promote primaries in a way that does not lose transactions or makes visible a transaction that is not durable.
There still is a Postgres deficiency that makes something similar to this pattern possible. Non-replicated transactions where the client goes away mid-commit become visible immediately. So in the example, if T1 happens on a partitioned leader, disconnects during commit, T2 also happens on a partitioned node, and T3 and T4 happen later on a new leader, you would also see the same result. However, this does not jive with the statement that fault injection was not done in this test.
Edit: did not notice the post that this pattern can be explained by inconsistent commit order on replica and primary. Kind of embarrassing given I've done a talk proposing how to fix that.
So if snapshot violation is happening inside Multi-AZ instances, it can happen with a single region - multiple read replica kind of setup as well ? But it might be easily observable in Multi-AZ setups because the lag is high ?
A synchronous replica via WAL shipping is a well-worn feature of Postgres. I’d expect RDS to be using that feature behind the scenes and would be extremely surprised if that has consistency bugs.
Two replicas in a “semi synchronous” configuration, as AWS calls it, is to my knowledge not available in base Postgres. AWS must be using some bespoke replication strategy, which would have different bugs than synchronous replication and is less battle-tested.
But as nobody except AWS knows the implementation details of RDS, this is all idle speculation that doesn’t mean much.
I don't think it's possible with ANY set up. All you get is that some replicas are more outdated than others. But they won't return 2 conflicting states when ReplicaA says tx1 wrote (but not tx2), while ReplicaB says tx2 wrote (but not tx1). Which is what Long Fork and Parallel Snapshot are about.
So Amazon Multi-cluster seems to replicate changes out of order?
Kinda. I think it's "just" PostgreSQL behaviour that's to blame here: On replicas, transaction commit visibility order is determined by the order of WAL records; on the primary it's based on when the backend that wrote the transaction notices that its transaction is sufficiently persisted.
I think it's still an important clarification, because for years you've had a choice in RDS (classic RDS, not Aurora) between "single-AZ" and "multi-AZ" instances, with the general rule of thumb that production workloads should always be multi-AZ.
however, "multi-AZ" has been made ambiguous, because there are now multi-AZ instances and multi-AZ clusters.
...and your multi-AZ "instance", despite being not a multi-AZ "cluster" from AWS's perspective, is still two nodes that are "clustered" together and treated as one logical database from the client connection perspective.
see [0] and scroll down to the "availability and durability" screenshot for an example.
Multi-AZ instances is a long-standing feature of RDS where the primary DB is synchronously replicated to a secondary DB in another AZ. On failure of the primary, RDS fails over to the secondary.
Multi-AZ clusters has two secondaries, and transactions are synchronously replicated to at least one of them. This is more robust than multi-AZ instances if a secondary fails or is degraded. It also allows read-only access to the secondaries.
Multi-AZ clusters no doubt have more "magic" under the hood, as its not a vanilla Postgres feature as far as I'm aware. I imagine this is why it's failing the Jepsen test.