In these systems you propose, is it possible to store multiple logical transactions per IO operation? My batching approach allows me to say things like "business operations per disk IO" (a positive integer greater than 1 in many cases).
To get a better idea - Assume a 4K block size, and 128-byte size for some business type. You can hypothetically store 32 of these per physical write if you have perfect batching going on. Looking at a Samsung 960 which has ~2.1GB/s sequential write speed (which is approximately our use case), you would be able to persist ~16.4 million transactions per second in the most ideal scenario. This is extremely notable, because the maximum stated random write throughput for this device at QD32 w/ 4k block size is only ~440kops/s. Mix in even the most pedestrian of compression algorithms, and this 16m turns into an even more ridiculous figure assuming your transactions look somewhat similar to each other, the system is fully loaded, and the planets are appropriately aligned. These circumstances might sound extreme, but they are fairly common in areas like fintech where you might need to match millions of orders in ~1 second and its non-stop like this for hours.
I am not aware of how the LSM+WAL would meet the objectives I have laid out above - Most notably the implication that the disk is being touched multiple times per business transaction. Please correct me if I am mistaken in this regard. My solution ensures that the disk is touched <= 1 time per logical write (where the size of the write is bounded by the block size of the device).
Yes, when you write to a file it doesn't actually hit the disk, it stays in some kernel buffer until the buffers get too big or fsync (or variants) is explicitly called. For example, in RocksDB you'd issue a few writes, and then call SyncWAL() to actually perform the IO and durably commit to disk (or issue a final sync=true write).
This is not something specific LSMs implement, it's just how kernels do file IO.
RocksDB does also additional IO coalescing for concurrent writes, though that's more about reducing syscalls cost (one write() per write group, instead of one per write) than IO cost.
> For example, in RocksDB you'd issue a few writes, and then call SyncWAL() to actually perform the IO and durably commit to disk (or issue a final sync=true write).
Ok - that makes sense. I think we are mostly on the same page here. My "SyncWAL" is invoked naturally as my buffer reader hits the barrier each time and dumps the current batch of items to be processed.
To get a better idea - Assume a 4K block size, and 128-byte size for some business type. You can hypothetically store 32 of these per physical write if you have perfect batching going on. Looking at a Samsung 960 which has ~2.1GB/s sequential write speed (which is approximately our use case), you would be able to persist ~16.4 million transactions per second in the most ideal scenario. This is extremely notable, because the maximum stated random write throughput for this device at QD32 w/ 4k block size is only ~440kops/s. Mix in even the most pedestrian of compression algorithms, and this 16m turns into an even more ridiculous figure assuming your transactions look somewhat similar to each other, the system is fully loaded, and the planets are appropriately aligned. These circumstances might sound extreme, but they are fairly common in areas like fintech where you might need to match millions of orders in ~1 second and its non-stop like this for hours.
I am not aware of how the LSM+WAL would meet the objectives I have laid out above - Most notably the implication that the disk is being touched multiple times per business transaction. Please correct me if I am mistaken in this regard. My solution ensures that the disk is touched <= 1 time per logical write (where the size of the write is bounded by the block size of the device).