PostgreSQL High Availability Solutions – Part 1: Jepsen Test and Patroni

emmanueloga_ · 2024-12-08T05:20:26 1733635226

Is anyone here using YugabyteDB for high-availability Postgres?

It seems like a compelling option:

* Much closer to Postgres compatibility than CockroachDB.

* A more permissive license.

* Built-in connection manager [1], which should simplify deployment.

* Supports both high availability and geo-distribution, which is useful if scaling globally becomes necessary later.

That said, I don't see it mentioned around here often. I wonder if anyone here has tried it and can comment on it.

--

1: https://docs.yugabyte.com/preview/explore/going-beyond-sql/c...

hazaskull · 2024-12-08T11:37:18 1733657838

I was under the impression that Yugabyte requires signing a CLA to contribute which leads me to avoid it for fear of them relicensing the thing when the VC's start squeezing. Also: very unique and single vendor driven. Seems like too much of a risk longer term but that is just my take.

EDIT: in response to your question I did run a PoC of it but it had issues where I wasn't able to create very large indexes without the statement timing out on me. Basic simple hand-benchmarking of complex joins on very large tables were very slow if they finished at all. I suppose systems like this and cockroach really need short, simple statements and high client-concurrency rather than large, complex queries.

sgarland · 2024-12-08T16:58:08 1733677088

> DDL timeouts

That’s normal for building indices on large tables, regardless of the RDBMS. Increase the timeout, and build them with the CONCURRENTLY option.

> Query speed

Without knowing your schema and query I can’t say with any certainty, but it shouldn’t be dramatically slower than single-node Postgres, assuming your table statistics are accurate (have you run ANALYZE <table>?), necessary indices are in place, and there aren’t some horrendously wrong parameters set.

fweimer · 2024-12-08T11:44:58 1733658298

Not sure about the CLA process, but the database is already under a restrictive, proprietary license:

    ## Free Trial
    
    Use to evaluate whether the software suits a particular
    application for less than 32 consecutive calendar days, on
    behalf of you or your company, is use for a permitted purpose.

https://github.com/yugabyte/yugabyte-db/blob/master/licenses...

It's not really clear what this means (what is a permitted purpose?), but it seems the intent is that after 32 days, you are expected to pay up. Or at least prepare for a future when the infrastructure to charge customers is in place (if it isn't there yet).

hazaskull · 2024-12-08T11:48:32 1733658512

Thanks. I think that only covers the commercial bits they run themselves though:

  "The entire database with all its features (including the enterprise ones) is licensed under the Apache License 2.0


  The binaries that contain -managed in the artifact and help run a managed service are licensed under the Polyform Free Trial License 1.0.0."

EDIT: formatting

senorsmile · 2024-12-19T21:06:57 1734642417

It also mentions: > By default, the build options generate only the Apache License 2.0 binaries.

So, it seems like the proprietary builds are for the managed services that they host themselves, which makes sense.

franckpachot · 2024-12-09T21:57:22 1733781442

Index creation should not be controlled by statement timeout, but backfill_index_client_rpc_timeout_ms which defaults to 24 hours. May have been lower in old versions

anonzzzies · 2024-12-08T08:21:33 1733646093

It seems cockroach got all the love here indeed. We use Yugabyte and we are happy with it; for our usecases it is a lot faster and easier to work with than cockroach.

atombender · 2024-12-08T11:22:57 1733656977

I'm curious about this as well. I often see people talk about CockroachDB in production, but I don't think I've ever heard of anyone running Yugabyte. But it is definitely under active development.

I found two threads discussing it from the past year:

https://news.ycombinator.com/item?id=39430411

https://news.ycombinator.com/item?id=38914764

Yugabyte (as with CockroachDB and TiDB) is based on mapping relations to an LSM-tree-based KV store, where ranges of keys get mapped to different nodes managed through a Raft group. That kind of structure has very different performance characteristics compared to Postgres' page-based MVCC. In particular, LSM trees are not a free lunch.

Query execution is also very different when a table's data is spread over multiple nodes. For example, joins are done on the query executor side by executing remote scans against each participating storage node and then merging the results. That's always going to be slower than a system that already has all the data locally.

YB also lacks some index optimizations. There is some work to make bitmap index scans work in YB, which will give a huge performance boost to many queries, but it's incomplete. YB does have some optimizations (like loose index scans) that Postgres does not have. So it's probably fair to say that YB is probably a lot slower than PG for some things and a little faster at others.

I think it's fundamentally not a bad architecture, just different from Postgres. So even though they took the higher layers from Postgres, there's a whole bunch of rearchitecting needed in order to make the higher layers work with the lower ones. You do get some Postgres stuff for free, but I wonder if the amount of work here is worth it in the end. So much in Postgres makes the assumption of a local page heap.

What we see in cases where someone takes Postgres and replaces the guts (Greenplum, Cloudberry, and of course YDB) is that it becomes a huge effort to keep up with new Postgres versions. YDB is on Postgres 12, which came out in 2019, and is slowly upgrading to 15, which came out 2022. By the time they've upgraded to 15, it will probably be 2-3 versions behind, and the work continues.

Worth noting: Yugabyte was tested by Kyle Kingsbury back in 2019, which uncovered some deficiencies. Not sure what the state is today. The YB team also runs their own Jepsen tests now as part of CI, which is a good sign.

kmuthukk · 2024-12-10T05:56:09 1733810169

Regarding the last point << Yugabyte was tested by Kyle Kingsbury back in 2019, which uncovered some deficiencies. Not sure what the state is today. The YB team also runs their own Jepsen tests now as part of CI, which is a good sign. >>

Please see this blog https://www.yugabyte.com/blog/chaos-testing-yugabytedb/ for latest updates, as well as information on additional in-house built frameworks for resiliency and consistency testing.

harisund1990 · 2024-12-10T08:21:07 1733818867

> What we see in cases where someone takes Postgres and replaces the guts (Greenplum, Cloudberry, and of course YDB) is that it becomes a huge effort to keep up with new Postgres versions.

The first upgrade is the hardest, but after that we will have the framework in place to perform consecutive upgrades much sooner. When the pg11 to pg15 upgrade becomes available it will be in-place online without affecting the DMLs, no other pg fork offers this capability today.

atombender · 2024-12-10T11:10:19 1733829019

I was referring to the effort by the developers to keep the forked codebase itself up to date with mainline. Isn't that the main hurdle?

My understanding is that you are patching a lot of core Postgres code rather than providing the functionality through any kind of plugin interface, so every time there is a major Postgres release, "rebasing" on top of it is a large effort.

That, to my knowledge, is why Greenplum fell behind so much. It took them four years to get from 9.6 to 12, and I believe that's where they are today.

harisund1990 · 2024-12-10T17:34:24 1733852064

Cutis is an extension. That's the best you can get by being outside the core db. If you want true distributed architecture then you need to change the QO, DDL, transaction, even query stat components. At which point it ends up being a fork.

Yes, the merges are hard. But pg12 changed lots of fundamental things making it very challenging. Pg15 to pg17 should be much simpler.

ffsm8 · 2024-12-08T10:36:34 1733654194

Yugabyte is Postgres compatible, not actually Postgres.

It's also only compatible insofar that you can use only a subset of Postgres features, as they're only supporting the most basics things like select, views etc

Triggers, notifys etc were out of scope the last time I checked (which has admittedly been a while)

remram · 2024-12-08T14:58:43 1733669923

You're right of course, it's not entirely compatible, but this might be interesting:

> We use vanilla Postgres as-is for the query layer and replace Postgres storage with YugabyteDB’s own distributed storage engine.

https://www.yugabyte.com/blog/yugabytedb-enhanced-postgres-c...

ffsm8 · 2024-12-08T18:04:17 1733681057

I only skimmed that blog post after a while because it reads like a pitch from an MBA targeted at venture capitalists.

I feel like thats not actually a distinction that matters to application developers, because they know thats just a technical detail that only concerns the developers of the database. Ultimately, all the compatibility has to be implemented in the storage engine. The fact that they're using Postgres's porcelain is surely a time saver for them, but of no consequence to the consumers/users of the database

franckpachot · 2024-12-09T22:34:03 1733783643

YugabyteDB supports much more than basic things. I've been a 3+ years dev advocate for Yugabyte, and I've always seen triggers. LISTEN/NOTIFY is not yet there (it is an anti-pattern for horizontal scalability, but we will add it as some frameworks use it). Not yet 100% compatible, but there's no Distributed SQL with more PG compatibility. Many (Spanner, CRDB, DSQL) are only wire protocol + dialect. YugabyteDB runs Postgres code and provides the same behavior (locks, isolation levels, datatype arithmetic...)

mroche · 2024-12-08T09:17:27 1733649447

One thing possibly holding some folks back is the version of Postgres it's held back to. Right now YDB has PostgreSQL 12 comparability. Support for PG15 is under active development, so hopefully it's a 2025 feature. I really wanted to be able to actually use YugabyteDB for once, but our developers reportedly are using PG15+ features.

https://github.com/yugabyte/yugabyte-db/issues/9797

concerndc1tizen · 2024-12-08T10:00:07 1733652007

YDB is another database, they unfortunately didn't protect that trademark.

But they do call it yugabyteDB, YugabyteDB, YugaByte DB, yugabyte-db, and Yugabyte.

ddorian43 · 2024-12-10T06:33:11 1733812391

Right now it's based on PostgreSQL 11.2 with some patches pulled from newer releases. The upgrade will be to PG 15, and includes work to make further PG upgrades easier (think online upgrading a cluster on postgresql major versions).

hitpointdrew · 2024-12-08T03:14:05 1733627645

The best way I have found is to setup keepalived -> pgbouncer -> Postgres. Use repmgr to manage replication and barman for backups. Setup a VIP with keepalived with a small script that checks if the server is primary. You loose about 7-9 pings during a failover, have keepalived check about every 2 seconds and flip after 3 consecutive failures.

no1youknowz · 2024-12-08T11:44:39 1733658279

Haven't used it yet. But seeing as both Yugabyte and Cockroach being mentioned...

pgEdge: https://github.com/pgedge/pgedge Demo: https://youtu.be/Gpty7yNlwH4?t=1873

Not affiliated with them.

I recall that aspirationally pgEdge aims to be compatible with the latest pg version or one behind.

nh2 · 2024-12-08T12:02:28 1733659348

Great that nobody can track, or easily contribute to, the underlying postgres bug, because postgres has no issue tracker.

Keeps the number of reported bugs nice and low. The discussion of critical bugs that lose your data is left to HN and Twitter threads instead.

williamstein · 2024-12-08T13:51:32 1733665892

Wow, they do public issue tracking in an unusual Way, via a mailing list: https://www.postgresql.org/list/pgsql-bugs/

hiatus · 2024-12-08T15:27:05 1733671625

So, like Linux? https://docs.kernel.org/admin-guide/reporting-issues.html

zxexz · 2024-12-08T17:39:08 1733679548

The issue mailing list issue tracker works quite well.

hamandcheese · 2024-12-08T02:22:44 1733624564

> If the PostgreSQL backend is cancelled while waiting to acknowledge replication (as a result of packet cancellation due to client timeout or backend failure) transaction changes become visible for other backends. Such changes are not yet replicated and may be lost in case of standby promotion.

This sounds like the two generals problem, which has no solution. But I may be misunderstanding.

wmf · 2024-12-08T03:38:19 1733629099

"[The two generals problem] said that you can't achieve consensus (both safety and liveness at the same time), they did not say you have to sacrifice safety under message-losses or asynchrony conditions. So Paxos preserves safety under all conditions and achieves liveness when conditions improve outside the impossibility realm (less message losses, some timing assumptions start to hold)." http://muratbuffalo.blogspot.com/2010/10/paxos-taught.html

gmokki · 2024-12-08T13:47:12 1733665632

Wouldn't the simple fix be to delay backend (= one connection) closing until all pending replication it initiated is finished?

That still leaves actual crashes, which would need to use the shared memory to store the list of pending replications before the recovery of transactions is finished.

arcastroe · 2024-12-08T21:05:18 1733691918

> require mandatory telemetry collection for free version

Couldn't one simply define kubernetes network policies to limit egress from CockroachDB pods?

logifail · 2024-12-08T06:09:40 1733638180

I'm currently looking for similar info but for MySQL/MariaDB for an IoT side project ... any suggestions?

raffraffraff · 2024-12-08T06:31:04 1733639464

MySQL has had first class replication and failover built into it for years. You deploy a server, enable binlogs, clone the server (percona's xtranackup can live-clone a server locally or remotely) and start the new instance. Then you point the replica at the master using 'CHANGE MASTER TO <host> ...' and it starts pulling binlogs and applying them. On the master, you can repeat this last step, making the replica it's master. This means that you have multi-master replication. And it just works. There are some other tools you can use to detect failure of the current master and switch to another, that is up to you.

There are also solutions like MySQL Cluster and Galera which provide a more cluster-like solution with synchronous replication. If you've got a suitable use case (low writes, high reads and no gigantic transactions) this can work extremely well. You bootstrap a cluster on node 1, and new members automatically take a copy of the cluster data when they join. You can have 3 or 5 mode clusters, and reads are distributed across all nodes since it's synchronous. Beware though, operating one of these things requires care. I've seen people suffer read downtime or full cluster outages by doing operations without understanding how they work under the hood. And if you're cluster is hard-down, you need to pick the node with "most recent transaction" to re-bootstrap the cluster or you can lose transactions.

stephenr · 2024-12-08T12:47:54 1733662074

Depends what sort of solution you want.

There's regular single primary/n replica replication built in. There's no built in automatic failover.

There's also Group replication built in. This can be either single primary/n replica with automatic election of a new primary during a failure, or it can be multi-primary.

Then there's Galera, which is similar to the multi-primary mode of Group replication.

linsomniac · 2024-12-08T17:02:45 1733677365

I've been running some small galera clusters for close to a decade with very good results. It's multi-master, and mostly "just works", though once every year or three I have to go in and re-bootstrap the cluster from the node with the highest sequence number if all of the nodes have gone down for one reason or another. I have some ansible playbooks and scripts that help with it.

My use case is a fairly small database of email address mappings. I have a mail server that sends out emails on behalf of our users, using their outside e-mail address (gmail, yahoo, whathaveyou). In order to allow replies, but prevent being an open relay, I have a milter that creates a lookalike address, with a token in it ([email protected] -> [email protected]).

I store those mappings in the mysql database, less than 10K of those. So a trivial database, very high read to write ratio.

More recently, maybe 3-4 years ago, I added another couple tables that store all the logs. I have a job that reads the postfix logfiles and writes out all the log information so we can show users the status of every of their outgoing e-mails. That upped the amount of traffic to the database by quite a lot, but it's still pretty simple: basically no transactions, just simple single insert statements, and IIRC one overview table that gets some updates, then a clean up job to delete them after 30-90 days.

Galera has been a real workhorse.

For a while I was going to go with cockroachdb, and I set up a POC. It was pretty robust with setting it up and clustering, I never ran into having to re-bootstrap the couster. But, at the time, postfix couldn't write directly to cockroachdb because cockroach could only do UTF-8, and the postfix pgsql code would only do latin-1 or something. This has changed since and I've thought about switching back to cockroach, but I hear there are some licensing issues I may have to be aware of.

anonzzzies · 2024-12-08T08:33:30 1733646810

We use proxysql (https://proxysql.com) which works very well. We have not seen any downtime for years. We wrote our own master promotion code a very long time ago; it has proven to be very robust.

nh2 · 2024-12-08T12:03:16 1733659396

It would be awesome if you could do the same test with Stolon!

ahoka · 2024-12-08T22:27:08 1733696828

The attached code actually mentions Stolon in some comments, so maybe that would be a future post from the author?

karlmdavis · 2024-12-08T02:52:02 1733626322

What an absolutely delightful little project and write up.

ptman · 2024-12-08T17:15:20 1733678120

Anyone familiar with autobase.tech?

ahoka · 2024-12-08T09:41:48 1733650908

Is there any alternative to Jepsen that does not involve writing spaghetti Clojure code?

refset · 2024-12-08T16:25:17 1733675117

Writing Clojure without spaghetti isn't too hard, and definitely more practical than waiting for a Jepsen alternative to come along.

The Jepsen author gave a great talk on all the performance engineering work that has gone into it, Jepsen is near enough an entire DBMS in its own right https://www.youtube.com/watch?v=EUdhyAdYfpA

ahoka · 2024-12-08T22:30:53 1733697053

I have found that some projects are using Porcupine as an alternative.

I just really couldn’t justify Clojure in my project and personally if I want Lisp, I know where to find it.