> Cool, our hypothesis has been validated. This should be easy to fix: if we sync the log when we open it, we should guarantee that no unsynced reads are ever served:
I'm not sure this is accurate in general. If fsync fails (either before or after the crash), the dirty pages will be marked as clean, but may never make it to disk. I think the only solution to this problem that is reasonably portable is to use direct IO.
EDIT:
The author pointed out this line to me on Twitter:
> but for today, let’s keep things simple and assume syncing is a reliable process.
I'll leave my comment for others to read, but yes if fsync is deemed reliable that solution seems OK.
> I have to caveat this post with the fact that we are ignoring large swathes of failure modes.
I work on a database [1] that has a strict storage fault model, and so I thought I'd add a few of these failure modes to the discussion, for example where I/O operations to and from disk may be: corrupted, misdirected to the wrong sector when writing or reading, rendered impossible because of a latent sector error, no-ops, performing no actual I/O while at the same time reporting success, significantly delayed, sometimes by several seconds (gray failure).
Beyond the obvious:
A corrupt log checksum does not always imply a system crash, fsync() on macOS [2] does not in fact flush to the disk "platter", and data may be corrupted at any time during its storage lifecycle (before being written, while being written, after being written, and while being read).
Disk sector writes are also not atomic, for example, an Advanced Format 4096 byte sector write to a disk with an emulated logical sector size of 4096 bytes but a physical sector size of 512 bytes is not atomic, and would be split into 8 physical sector writes, which may or may not be atomic.
The Linux kernel page cache is not always reliable and may misrepresent the state of data on disk after an EIO or latent sector error, and file system metadata (such as the size of the log file as stored in the inode) as presented by the kernel is not always reliable and may change at any time (the inode can simply be corrupted on disk). There should always be redundant storage of critical metadata such as the log file size—storing this out of band and not trusting in inodes or the filesystem. You need to treat the disk as a block device essentially.
> However, the solution to most of these problems is “use checksums,” which is not that exciting, which is why I’m omitting it from this discussion.
A better solution I believe is simply to start with a clearly defined storage fault model and to lean on the storage fault research that's out there. In fact, the research suggests that systems that reduce the problem space largely to checksums do not tend to get storage faults right, and checksums (while critical) are the least of the solution.
In fact, a large part of the solution after a clear model for safety is to use O_DIRECT [3], and to design protocols (and the whole system) so that they are storage fault-aware [4]. This means even as far out as the distributed consensus layer. Local storage faults propagate across distributed systems and break formal proofs of correctness for protocols such as Paxos and Raft [4], because these assume that stable storage is non-byzantine, so that replicated redundancy does not in fact imply fault-tolerance [5]. A single disk sector failure on one machine can result in global data loss or cluster unavailability.
One also has to be careful to disentangle crashes from corruption when recovering from the log — to know whether to truncate or to repair — the checksum alone can't make this distinction [4], and care needs to be taken when handling read/write syscall return values that indicate partial sector input/output especially where this interacts with different physical/logical sector sizes [6]. Another thing that's critical for successful log recovery is to be able to read "around" faulty sectors to preserve durability and surface as much data as possible to higher layers [7].
Checksums also need to be daisy-chained or hash-chained, like ZFS does, all the way back to the copy-on-write superblock. Otherwise, a read may simply read the wrong block without getting a checksum failure. There also needs to be proactive scrubbing of disks to detect and repair storage faults before they take out all copies. Checksums are of limited use if there's no scrubbing, no fault detection in the first place.
> Cool, our hypothesis has been validated. This should be easy to fix: if we sync the log when we open it, we should guarantee that no unsynced reads are ever served
Here's an example of this in Zig [8], assuming O_DIRECT, that Alex Miller of FoundationDB pointed out to me.
> If we want to improve our throughput, we need to find a way to turn multiple logical writes (Commands) into a single physical “batch.”
This is also our own experience, and how TigerBeetle is able to process more than a million financial transactions a second. We reduce the problem down to a matter of: how fast can we write sequentially to disk, and how fast can we send data across the network, while maintaining stability and working around slowly failing hardware.
> The easier way to think about this is that the MEMTABLE (or some other indexed structure) is the afterthought. It’s just the cache in front of where the real action is: the log.
This is where TigerBeetle is at now. We literally have a deterministic MEMTABLE—deterministic so that we can do autonomous testing, ala FoundationDB—in front of the consensus log. Every consensus operation, we ratchet the LSM-Tree a little, deterministically, so that compaction is incremental and streaming without any foreground latency stalls, avoiding the need for write throttling.
We're coding this live on Twitch everyday (in Zig!) if you'd like to drop by and chat!
I agree with nearly everything here, but I'm hesitant to recommend O_DIRECT. In my experience O_DIRECT provides a false sense of security and if you're already solving for unreliable physical storage then this doesn't matter. You should only use O_DIRECT when there is a clear and measurable performance advantage.
The paper shows that for databases requiring durability and redo logging, O_DIRECT is a good idea for safety.
I do enjoy working with O_DIRECT. I find it works best when designing systems from the ground up for O_DIRECT. It leads to a very simple system in the end that works equally well on block devices, which is a good place to be. And the performance gains by not thrashing the CPU cache through memcopies to the page cache are nice.
It's interesting how the proper handling of local storage faults is now also recognized to be all the more critical for global replicated systems—that local faults do propagate across distributed systems.
For example, when not using O_DIRECT, an ack back to the consensus protocol, for local data that was recovered from the log at startup, and not in fact made durable (only marked clean in the kernel page cache after an fsync failure), could cause a quorum swing after the next reboot, and global data loss.
Thanks so much for the thoughtful response! TigerBeetle seems very cool, I hope to dig in and learn more about it at some point. As an aside, your Twitter thread on macOS fsync behaviour was very helpful for me!
It's a huge pleasure, I'm glad you enjoyed it. Diving into those fsync details at the time was a lot of fun!
If you have time at some point to dig in more, we also did a Zig SHOWTIME intro on TigerBeetle that you might like (https://www.youtube.com/watch?v=BH2jvJ74npM) where we dig into these storage faults in more detail, plus a little Lord of the Rings ;)
I'm not sure this is accurate in general. If fsync fails (either before or after the crash), the dirty pages will be marked as clean, but may never make it to disk. I think the only solution to this problem that is reasonably portable is to use direct IO.
EDIT:
The author pointed out this line to me on Twitter:
> but for today, let’s keep things simple and assume syncing is a reliable process.
I'll leave my comment for others to read, but yes if fsync is deemed reliable that solution seems OK.