The disaggregated write-ahead log (2023)

pilgrim0 · on Feb 8, 2024

Note that I’m commenting in the spirit of “being ok to dream”, as the article promoted.

There’s only so much creativity available for software engineers to work around the fundamental constraints imposed by the lower levels. Software Engineering being a highly demanded skill, I strongly believe, is a reflection of how inefficient are the interlaces of standard architectures and processes. It reminds me of the Curse of Lisp, which, roughly, states that the reason for its lack of popularity and absence of large communities of developers stems from it being too powerful of a language, when compared, say, to Java. OSes are monstrosities struggling to keep pace with demand. The very concept and handling of the File is limiting and obsolete. It’s not an ideal building block for databases. Another is the fact that we have generalized compute but not storage. Memory and Storage are synonyms, so I imagine the ideal scenario is for both to be a single entity. I mean, the memory hierarchy should be flattened in the future. If this happens concurrently with an increase in capacity sufficient to emulate an infinite tape, then a number of today’s cutting edge software architectures will become relics — memories of a time when computers were not as cooperative and malleable as we needed them to be. The information revolution is just beginning, after all, and I absolutely love the fact that the path forward will be paved, primarily, by the collective effort of a multitude of creative minds, as always. Great article.

epaulson · on Feb 7, 2024

This sentence is doing a lot of work: "Hypothetical S2 does a bit more to simplify the layers above – it makes leadership above the log convenient with leases and fenced writes."

It'd be awesome to have a bit more transactional help from S3. You could go a long way with 'only update this object if the ETags on these other objects are still the same'. I know AWS doesn't want to turn S3 into a full database but some updates you just can't do without having a whole 2nd service running alongside to keep track of the states of your updates.

shikhar · on Feb 7, 2024

Agreed, both Google Cloud Storage and Azure Blob Storage support preconditions. Azure even has leases. S3 is for better or worse the common denominator for systems layering on top of object storage.

refset · on Feb 8, 2024

Consensus protocols, durability and transactional semantics are (should be) closely coupled. I recall TigerBeetle discussing somewhere how they could achieve better throughput and durability guarantees by combining replication/recovery with the consensus protocol, instead of layering it above. I.e. disaggregating the log can be expensive. There's a reference in [0] that might elaborate.

> TigerBeetle is “fault-aware” and recovers from local storage failures in the context of the global consensus protocol, providing more safety than replicated state machines such as ZooKeeper and LogCabin

[0] https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/DE...

shikhar · on Feb 8, 2024

I believe TigerBeetle are alluding to their integration of protocol-aware recovery [1], which is a worthy consideration for the log implementation. Yet another engineering concern which can be offloaded.

If the disaggregated log integrates some mechanism to support leadership "above" it [2], it can be functionally identical to a converged log. Efficiency-wise yes there will be some extra network messages – but networks are very high throughput [3] and fast (sub-millisecond within a cloud region) these days!

[1] https://www.usenix.org/conference/fast18/presentation/alagap... [2] https://maheshba.bitbucket.io/blog/2023/05/06/Leadership.htm..., also on HN yesterday [3] https://blog.enfabrica.net/the-next-step-in-high-performance...

refset · on Feb 8, 2024

Thanks, yes protocol-aware recovery was the context. Pretty sure I first heard it described in Joran's QCon London 2023 talk here: https://youtu.be/_jfOk4L7CiY?t=1460

> If you want your distributed database to maximise availability, how your local storage engine recovers from storage faults in the write-ahead log needs to be properly integrated with the global consensus protocol.

bvrmn · on Feb 7, 2024

Generic robust replicated log would be nice to have. For two products I've implemented leaderless oplog to back application level replication to sync not only database changes but also configuration files and other needed data. Works like a charm.

EdwardDiego · on Feb 8, 2024

Have you evaluated Bookkeeper? I haven't used it much, but keen on your thoughts on it especially on the robustness.

esafak · on Feb 8, 2024

It underlies Apache Pulsar. https://pulsar.apache.org/docs/3.1.x/administration-zk-bk/

EdwardDiego · on Feb 14, 2024

It does :) That was why I asking. The times I evaluated Pulsar (and thus BookKeeper) I found myself unconvinced of BK's reliability, just because of wrong docs, comments in the user facing code that were wrong etc.

But I never used it for real, with real data volume and throughput, so maybe the actual implementation was solid. The concepts in BK definitely made a lot of sense.

You need bear in mind that Pulsar is the FOSS version of Twitter's homebrewed PubSub, with the same people leading it now (Streamnative) who led it in Twitter.

And Twitter replaced it with Kafka. Which added to my caution.

So yeah, thought I'd ask if they'd evaluated BK, because it's relevant to the reliability of Pulsar.

If BK is ZK levels of solid, then that's awesome.

HammadB · on Feb 7, 2024

Curious - what was the specific leaderless replication strategy?

bvrmn · on Feb 7, 2024

Each operation has uniq id. Each node has own log with autoincremented op counter. According to topology rules. Each node tries to fetch new parts of remote logs with new ids from other nodes and apply to local state. Domain allows to express most operations in order independent manner. And for rest it's possible to use last op wins.

ditsuke · on Feb 7, 2024

Remember reading this one just last month!