You don't *want* the OS to take care of reading from disk and page caching/evict...

ayende · on Jan 14, 2022

I'm the author (well, one of) RavenDB

You are correct to an extent, but there are a few things yo noted.

* you can design your system so the access pattern that the OS is optimized for matches your needs

* you can use madvise() to give some useful hints

* the amount of complexity you don't have to deal with is staggering

tytso · on Jan 14, 2022

OTOH, if you care about that last 5 percent or so of performance there is the complexity that what the OS has optimized for might be different between different OS's (e.g., MacOS, Linux, FreeBSd, etc.) and indeed, might change between different versions of Linux, or even, in the case of buffered writeback, between different filesystems on the same version of Linux. This is probably historically one of the most important reasons why enterprise databases like Oracle DB, DB2, etc., have used direct I/O, and not buffered I/O or mmap.

Speaking as an OS developer, we're not going to try to optimize buffered I/O for a particular database. We'll be using becnhmarks like compilebench and postmark to optimize our I/O, and if your write patterns, or readahead patterns, or caching requirements, don't match those workloads, well.... sucks to be you.

I'll also point out that those big companies that actually pay the salarise of us file system developers (e.g., Oracle, Google, etc.) for the most part use Direct I/O for our performance critical workloads. If database companies that want to use mmap want to hire file system developers and contribute benchmarks and performance patches for ext4, xfs, etc., speaking as the ext4 maintainer, I'll welcome that, and we do have a weekly video conference where I'd love to have your engineers join to discuss your contributions. :-)

ayende · on Jan 14, 2022

The key from my perspective is that I CAN design my access patterns to match what you'll optimized.

Another aspect to remember is that mmap being even possible for databases as the primary mechanism is quite new.

Go 15 years ago and you are in 32 bit land. That rule out mmap as your approach.

At this point, I might as well skip the OS and go direct IO.

As for differ OS behavior, I generally find that they all roughly optimize for the same thing.

I need best perf on Linux and Windows. Other systems I can get away with just being pretty good

jfindley · on Jan 15, 2022

The mongodb developers once thought as you did. They were wrong, although it took a fair while for them to realise this. Yes it's complex. Extremely complex, and as another poster noted, the learning curve is horrible and documentation is extremely limited. Unfortunately there's no real substitute.

The mmap/madvise approach works well for things like varnish cache, where you have a flat collection of similar and largely unrelated objects. It does not work well for databases where you have many different types of data, some of which are interrelated, and all want to be handled differently. If you can meet the performance needs for your product by doing what you're doing then great - that's a fantastic complexity saving for your business. But the claim that "you can design your system so the access pattern that the OS is optimized for matches your needs" is unfortunately not true. It might be good enough for what you need, but it's not optimal. That's why there's so many lines of code in other DB engines doing this the hard way.

hyc_symas · on Jan 15, 2022

The MongoDB developers were morons. They used mmap poorly and gained none of its potential advantages. Their incompetence and failures are not an indictment against using mmap.

For as long as computers and databases have existed, there has been a war between DB designers and OS designers, with DB designers always claiming they have better knowledge of workloads than OS designers. That can only ever possibly be true when the DB is the only process running on the machine. Whenever anything else is also running on the machine that claim can not possibly be true.

Reality today is that nobody runs on dedicated machines. Everyone runs on "the cloud" where all hardware is shared with an unknown and arbitrary number of other users.

sltkr · on Jan 14, 2022

The counterargument to this is that the kernel can make decisions based on nonlocal information about the system.

If your database server is the only process in the system that is using significant memory, then sure, you might as well manage it yourself. But if there are multiple processes competing for memory, the kernel is better equipped to decide which processes' pages should be paged out or kept into memory.

electricshampo1 · on Jan 14, 2022

Generally for perf critical use cases you dedicate the machine to the database. This simplifies many things (avoiding having to reason about sharing, etc etc).

munchler · on Jan 14, 2022

This makes me wonder whether there would be value in an OS that is also a DBMS (or vice versa). In other words, if the DBMS has total control over the hardware, perhaps performance can be maximized without too much additional complexity.

matt_d · on Jan 15, 2022

One example is DBOS: A Database-oriented Operating System, https://dbos-project.github.io/ / https://github.com/DBOS-project (more details under "Publications").

sedachv · on Jan 14, 2022

This is a bad idea from the 1960s: IBM TPF, MUMPS, Pick. As soon as the hardware changes it becomes slower and more complicated.

yellowapple · on Jan 15, 2022

That was back when hardware was changing to a significant degree, though. Nowadays, there ain't really much that's new about hardware today v. hardware from 10 or 20 years ago - hence operating systems / filesystems being able to remain mostly stable instead of suffering from the exact same problem.

matheusmoreira · on Jan 14, 2022

> the DB has information on access patterns and table format that the OS is not aware of

Aren't system calls such as madvise supposed to allow user space to let the kernel know precisely that information?

jandrewrogers · on Jan 14, 2022

The madvise() functions and similar are a blunt and imprecise instrument. The kernel is free to ignore them, and frequently does in practice. It also does not prevent the kernel from proactively doing things you don't want it to do with your buffer pool at the worst possible time.

A user space buffer pool gives you precise and deterministic control of many of these behaviors.

ayende · on Jan 14, 2022

That is an interesting statement, when discussing what you want to do

In this case, there is the issue of who is the you on question

For the database in isolation, maybe not ideal

For a system whre db and app run on the same machine? The OS can make sure you are on friendly terms and not fighting

Same for trying to SSH to a bust server and the OS can balance things out

masklinn · on Jan 14, 2022

> precisely

Madvise is discussed in the paper, and it notes specifically that:

* madvise is not precise

* madvise is... an advice, which the system is completely free to disregard

* madvise is error-prone, providing the wrong hint can have dire consequences