You don't want the OS to take care of reading from disk and page caching/eviction. You want the DB itself to have explicit control over that, because the DB has information on access patterns and table format that the OS is not aware of. It is better equipped than the OS to anticipate what portions of tables/indices need to be cached in memory. It is better equipped to calculate when/where/what/how much to prefetch from disk. It is better equipped to determine when to buffer writes and when to flush to disk.
Sure, it might be more work than using mmap. But it's also more correct, forces you to handle edge cases, and much more amenable to platform-specific improvements a la kqueue/io_uring.
OTOH, if you care about that last 5 percent or so of performance there is the complexity that what the OS has optimized for might be different between different OS's (e.g., MacOS, Linux, FreeBSd, etc.) and indeed, might change between different versions of Linux, or even, in the case of buffered writeback, between different filesystems on the same version of Linux. This is probably historically one of the most important reasons why enterprise databases like Oracle DB, DB2, etc., have used direct I/O, and not buffered I/O or mmap.
Speaking as an OS developer, we're not going to try to optimize buffered I/O for a particular database. We'll be using becnhmarks like compilebench and postmark to optimize our I/O, and if your write patterns, or readahead patterns, or caching requirements, don't match those workloads, well.... sucks to be you.
I'll also point out that those big companies that actually pay the salarise of us file system developers (e.g., Oracle, Google, etc.) for the most part use Direct I/O for our performance critical workloads. If database companies that want to use mmap want to hire file system developers and contribute benchmarks and performance patches for ext4, xfs, etc., speaking as the ext4 maintainer, I'll welcome that, and we do have a weekly video conference where I'd love to have your engineers join to discuss your contributions. :-)
The mongodb developers once thought as you did. They were wrong, although it took a fair while for them to realise this. Yes it's complex. Extremely complex, and as another poster noted, the learning curve is horrible and documentation is extremely limited. Unfortunately there's no real substitute.
The mmap/madvise approach works well for things like varnish cache, where you have a flat collection of similar and largely unrelated objects. It does not work well for databases where you have many different types of data, some of which are interrelated, and all want to be handled differently. If you can meet the performance needs for your product by doing what you're doing then great - that's a fantastic complexity saving for your business. But the claim that "you can design your system so the access pattern that the OS is optimized for matches your needs" is unfortunately not true. It might be good enough for what you need, but it's not optimal. That's why there's so many lines of code in other DB engines doing this the hard way.
The MongoDB developers were morons. They used mmap poorly and gained none of its potential advantages. Their incompetence and failures are not an indictment against using mmap.
For as long as computers and databases have existed, there has been a war between DB designers and OS designers, with DB designers always claiming they have better knowledge of workloads than OS designers. That can only ever possibly be true when the DB is the only process running on the machine. Whenever anything else is also running on the machine that claim can not possibly be true.
Reality today is that nobody runs on dedicated machines. Everyone runs on "the cloud" where all hardware is shared with an unknown and arbitrary number of other users.
The counterargument to this is that the kernel can make decisions based on nonlocal information about the system.
If your database server is the only process in the system that is using significant memory, then sure, you might as well manage it yourself. But if there are multiple processes competing for memory, the kernel is better equipped to decide which processes' pages should be paged out or kept into memory.
Generally for perf critical use cases you dedicate the machine to the database. This simplifies many things (avoiding having to reason about sharing, etc etc).
This makes me wonder whether there would be value in an OS that is also a DBMS (or vice versa). In other words, if the DBMS has total control over the hardware, perhaps performance can be maximized without too much additional complexity.
That was back when hardware was changing to a significant degree, though. Nowadays, there ain't really much that's new about hardware today v. hardware from 10 or 20 years ago - hence operating systems / filesystems being able to remain mostly stable instead of suffering from the exact same problem.
The madvise() functions and similar are a blunt and imprecise instrument. The kernel is free to ignore them, and frequently does in practice. It also does not prevent the kernel from proactively doing things you don't want it to do with your buffer pool at the worst possible time.
A user space buffer pool gives you precise and deterministic control of many of these behaviors.
"If you aren’t using mmap, on the other hand, you still need to handle of all those issues"
Which seems like a reasonable statement. Is it less work to make your own top-to-bottom buffer pool, and would that necessarily avoid similar issues? Or is it less work to use mmap(), but address the issues?
Questdb's author here. I do share Ayende's sentiment. There are things that the OP paper doesn't mention, which can help mitigate some of the disadvantages:
- single-threaded calls to 'fallocate' will help avoiding sparse files and SIGBUS during memory write
- over-allocating, caching memory addresses and minimizing OS calls
- transactional safety can be implemented via shared memory model
- hugetlb can minimize TLB shootdowns
I personally do not have any regrets using mmap because of all the benefits they provide
I suppose. Some problems with mmap() are a bit hard to fix from user land though. You will hit contention on locks inside the kernel (mmap_sem) if the database does concurrent high throughput mmap()/unmap(). I don't follow linux kernel development closely to know if this has been improved recently, but it was easy to reproduce it 4-5 years ago.
That makes sense. I wasn't going right to the conclusion that working around mmap() issues was easier, but it didn't seem to be explored much. Is the contention around having one file mmap()ed, or is it reduced if you use more files?
When I worked on/with BerkeleyDB in the late 90s we came to the conclusion that the various OS mmap() implementations had been tweaked/fixed to the point where they worked for the popular high profile applications (in those days: Oracle). So it can appear like everything is fine, but that probably means your code behaves the same way as <popular database du jour>.
Um... Oracle (and other enterprise databases like DB2) don't use mmap. They use Direct I/O. Oracle does have anonymous (non-file-backed) memory which is mmap'ed and shared across various Oracle processes, called the Shared Global Area (SGA), but it's not used for I/O.
Some issues with mmap() can be avoided entirely if you have your own buffer pool. Others are easier to handle because they are made explicit and more buffer state is exposed to the program logic. That's the positive side.
The downside is that writing an excellent buffer pool is not trivial, especially if you haven't done it before. There are many cross-cutting design concerns that have to be accounted for. In my experience, an excellent C++ implementation tends to be on the order of 2,000 lines of code -- someone has to write that. It also isn't simple code, the logic is relatively dense and subtle.
Thank you for these counter-arguments. It's good to have them to make up your own mind, especially when recognized experts use a mocking tone "you will no dare think the contrary".