I'm the author (well, one of) RavenDB You are correct to an extent, but there ar...

tytso · on Jan 14, 2022

OTOH, if you care about that last 5 percent or so of performance there is the complexity that what the OS has optimized for might be different between different OS's (e.g., MacOS, Linux, FreeBSd, etc.) and indeed, might change between different versions of Linux, or even, in the case of buffered writeback, between different filesystems on the same version of Linux. This is probably historically one of the most important reasons why enterprise databases like Oracle DB, DB2, etc., have used direct I/O, and not buffered I/O or mmap.

Speaking as an OS developer, we're not going to try to optimize buffered I/O for a particular database. We'll be using becnhmarks like compilebench and postmark to optimize our I/O, and if your write patterns, or readahead patterns, or caching requirements, don't match those workloads, well.... sucks to be you.

I'll also point out that those big companies that actually pay the salarise of us file system developers (e.g., Oracle, Google, etc.) for the most part use Direct I/O for our performance critical workloads. If database companies that want to use mmap want to hire file system developers and contribute benchmarks and performance patches for ext4, xfs, etc., speaking as the ext4 maintainer, I'll welcome that, and we do have a weekly video conference where I'd love to have your engineers join to discuss your contributions. :-)

ayende · on Jan 14, 2022

The key from my perspective is that I CAN design my access patterns to match what you'll optimized.

Another aspect to remember is that mmap being even possible for databases as the primary mechanism is quite new.

Go 15 years ago and you are in 32 bit land. That rule out mmap as your approach.

At this point, I might as well skip the OS and go direct IO.

As for differ OS behavior, I generally find that they all roughly optimize for the same thing.

I need best perf on Linux and Windows. Other systems I can get away with just being pretty good

jfindley · on Jan 15, 2022

The mongodb developers once thought as you did. They were wrong, although it took a fair while for them to realise this. Yes it's complex. Extremely complex, and as another poster noted, the learning curve is horrible and documentation is extremely limited. Unfortunately there's no real substitute.

The mmap/madvise approach works well for things like varnish cache, where you have a flat collection of similar and largely unrelated objects. It does not work well for databases where you have many different types of data, some of which are interrelated, and all want to be handled differently. If you can meet the performance needs for your product by doing what you're doing then great - that's a fantastic complexity saving for your business. But the claim that "you can design your system so the access pattern that the OS is optimized for matches your needs" is unfortunately not true. It might be good enough for what you need, but it's not optimal. That's why there's so many lines of code in other DB engines doing this the hard way.

hyc_symas · on Jan 15, 2022

The MongoDB developers were morons. They used mmap poorly and gained none of its potential advantages. Their incompetence and failures are not an indictment against using mmap.

For as long as computers and databases have existed, there has been a war between DB designers and OS designers, with DB designers always claiming they have better knowledge of workloads than OS designers. That can only ever possibly be true when the DB is the only process running on the machine. Whenever anything else is also running on the machine that claim can not possibly be true.

Reality today is that nobody runs on dedicated machines. Everyone runs on "the cloud" where all hardware is shared with an unknown and arbitrary number of other users.