RavenDB's response to this paper: https://ayende.com/blog/196161-C/re-are-you-su...

10000truths · on Jan 14, 2022

You don't want the OS to take care of reading from disk and page caching/eviction. You want the DB itself to have explicit control over that, because the DB has information on access patterns and table format that the OS is not aware of. It is better equipped than the OS to anticipate what portions of tables/indices need to be cached in memory. It is better equipped to calculate when/where/what/how much to prefetch from disk. It is better equipped to determine when to buffer writes and when to flush to disk.

Sure, it might be more work than using mmap. But it's also more correct, forces you to handle edge cases, and much more amenable to platform-specific improvements a la kqueue/io_uring.

ayende · on Jan 14, 2022

I'm the author (well, one of) RavenDB

You are correct to an extent, but there are a few things yo noted.

* you can design your system so the access pattern that the OS is optimized for matches your needs

* you can use madvise() to give some useful hints

* the amount of complexity you don't have to deal with is staggering

tytso · on Jan 14, 2022

OTOH, if you care about that last 5 percent or so of performance there is the complexity that what the OS has optimized for might be different between different OS's (e.g., MacOS, Linux, FreeBSd, etc.) and indeed, might change between different versions of Linux, or even, in the case of buffered writeback, between different filesystems on the same version of Linux. This is probably historically one of the most important reasons why enterprise databases like Oracle DB, DB2, etc., have used direct I/O, and not buffered I/O or mmap.

Speaking as an OS developer, we're not going to try to optimize buffered I/O for a particular database. We'll be using becnhmarks like compilebench and postmark to optimize our I/O, and if your write patterns, or readahead patterns, or caching requirements, don't match those workloads, well.... sucks to be you.

I'll also point out that those big companies that actually pay the salarise of us file system developers (e.g., Oracle, Google, etc.) for the most part use Direct I/O for our performance critical workloads. If database companies that want to use mmap want to hire file system developers and contribute benchmarks and performance patches for ext4, xfs, etc., speaking as the ext4 maintainer, I'll welcome that, and we do have a weekly video conference where I'd love to have your engineers join to discuss your contributions. :-)

ayende · on Jan 14, 2022

The key from my perspective is that I CAN design my access patterns to match what you'll optimized.

Another aspect to remember is that mmap being even possible for databases as the primary mechanism is quite new.

Go 15 years ago and you are in 32 bit land. That rule out mmap as your approach.

At this point, I might as well skip the OS and go direct IO.

As for differ OS behavior, I generally find that they all roughly optimize for the same thing.

I need best perf on Linux and Windows. Other systems I can get away with just being pretty good

jfindley · on Jan 15, 2022

The mongodb developers once thought as you did. They were wrong, although it took a fair while for them to realise this. Yes it's complex. Extremely complex, and as another poster noted, the learning curve is horrible and documentation is extremely limited. Unfortunately there's no real substitute.

The mmap/madvise approach works well for things like varnish cache, where you have a flat collection of similar and largely unrelated objects. It does not work well for databases where you have many different types of data, some of which are interrelated, and all want to be handled differently. If you can meet the performance needs for your product by doing what you're doing then great - that's a fantastic complexity saving for your business. But the claim that "you can design your system so the access pattern that the OS is optimized for matches your needs" is unfortunately not true. It might be good enough for what you need, but it's not optimal. That's why there's so many lines of code in other DB engines doing this the hard way.

hyc_symas · on Jan 15, 2022

The MongoDB developers were morons. They used mmap poorly and gained none of its potential advantages. Their incompetence and failures are not an indictment against using mmap.

For as long as computers and databases have existed, there has been a war between DB designers and OS designers, with DB designers always claiming they have better knowledge of workloads than OS designers. That can only ever possibly be true when the DB is the only process running on the machine. Whenever anything else is also running on the machine that claim can not possibly be true.

Reality today is that nobody runs on dedicated machines. Everyone runs on "the cloud" where all hardware is shared with an unknown and arbitrary number of other users.

sltkr · on Jan 14, 2022

The counterargument to this is that the kernel can make decisions based on nonlocal information about the system.

If your database server is the only process in the system that is using significant memory, then sure, you might as well manage it yourself. But if there are multiple processes competing for memory, the kernel is better equipped to decide which processes' pages should be paged out or kept into memory.

electricshampo1 · on Jan 14, 2022

Generally for perf critical use cases you dedicate the machine to the database. This simplifies many things (avoiding having to reason about sharing, etc etc).

munchler · on Jan 14, 2022

This makes me wonder whether there would be value in an OS that is also a DBMS (or vice versa). In other words, if the DBMS has total control over the hardware, perhaps performance can be maximized without too much additional complexity.

matt_d · on Jan 15, 2022

One example is DBOS: A Database-oriented Operating System, https://dbos-project.github.io/ / https://github.com/DBOS-project (more details under "Publications").

sedachv · on Jan 14, 2022

This is a bad idea from the 1960s: IBM TPF, MUMPS, Pick. As soon as the hardware changes it becomes slower and more complicated.

yellowapple · on Jan 15, 2022

That was back when hardware was changing to a significant degree, though. Nowadays, there ain't really much that's new about hardware today v. hardware from 10 or 20 years ago - hence operating systems / filesystems being able to remain mostly stable instead of suffering from the exact same problem.

matheusmoreira · on Jan 14, 2022

> the DB has information on access patterns and table format that the OS is not aware of

Aren't system calls such as madvise supposed to allow user space to let the kernel know precisely that information?

jandrewrogers · on Jan 14, 2022

The madvise() functions and similar are a blunt and imprecise instrument. The kernel is free to ignore them, and frequently does in practice. It also does not prevent the kernel from proactively doing things you don't want it to do with your buffer pool at the worst possible time.

A user space buffer pool gives you precise and deterministic control of many of these behaviors.

ayende · on Jan 14, 2022

That is an interesting statement, when discussing what you want to do

In this case, there is the issue of who is the you on question

For the database in isolation, maybe not ideal

For a system whre db and app run on the same machine? The OS can make sure you are on friendly terms and not fighting

Same for trying to SSH to a bust server and the OS can balance things out

masklinn · on Jan 14, 2022

> precisely

Madvise is discussed in the paper, and it notes specifically that:

* madvise is not precise

* madvise is... an advice, which the system is completely free to disregard

* madvise is error-prone, providing the wrong hint can have dire consequences

tyingq · on Jan 14, 2022

The really key part seems to be this:

"If you aren’t using mmap, on the other hand, you still need to handle of all those issues"

Which seems like a reasonable statement. Is it less work to make your own top-to-bottom buffer pool, and would that necessarily avoid similar issues? Or is it less work to use mmap(), but address the issues?

bluestreak · on Jan 14, 2022

Questdb's author here. I do share Ayende's sentiment. There are things that the OP paper doesn't mention, which can help mitigate some of the disadvantages:

- single-threaded calls to 'fallocate' will help avoiding sparse files and SIGBUS during memory write - over-allocating, caching memory addresses and minimizing OS calls - transactional safety can be implemented via shared memory model - hugetlb can minimize TLB shootdowns

I personally do not have any regrets using mmap because of all the benefits they provide

AdamProut · on Jan 14, 2022

I suppose. Some problems with mmap() are a bit hard to fix from user land though. You will hit contention on locks inside the kernel (mmap_sem) if the database does concurrent high throughput mmap()/unmap(). I don't follow linux kernel development closely to know if this has been improved recently, but it was easy to reproduce it 4-5 years ago.

ayende · on Jan 14, 2022

Almost no one is going to have a lot of map calls

Uou map the file once, then fault it in

tyingq · on Jan 14, 2022

That makes sense. I wasn't going right to the conclusion that working around mmap() issues was easier, but it didn't seem to be explored much. Is the contention around having one file mmap()ed, or is it reduced if you use more files?

dboreham · on Jan 14, 2022

When I worked on/with BerkeleyDB in the late 90s we came to the conclusion that the various OS mmap() implementations had been tweaked/fixed to the point where they worked for the popular high profile applications (in those days: Oracle). So it can appear like everything is fine, but that probably means your code behaves the same way as <popular database du jour>.

tytso · on Jan 14, 2022

Um... Oracle (and other enterprise databases like DB2) don't use mmap. They use Direct I/O. Oracle does have anonymous (non-file-backed) memory which is mmap'ed and shared across various Oracle processes, called the Shared Global Area (SGA), but it's not used for I/O.

hyc_symas · on Jan 15, 2022

Fwiw, I wrote a Direct I/O patch for BerkeleyDB but withdrew it later because it didn't ever improve I/O perf or memory footprint.

ayende · on Jan 14, 2022

Yes, isn't that wonderful?

You get to take advantage of literally decades of experience

What is more, if you can match the profile of the optimization, you can benefit even more

jandrewrogers · on Jan 14, 2022

Some issues with mmap() can be avoided entirely if you have your own buffer pool. Others are easier to handle because they are made explicit and more buffer state is exposed to the program logic. That's the positive side.

The downside is that writing an excellent buffer pool is not trivial, especially if you haven't done it before. There are many cross-cutting design concerns that have to be accounted for. In my experience, an excellent C++ implementation tends to be on the order of 2,000 lines of code -- someone has to write that. It also isn't simple code, the logic is relatively dense and subtle.

ikawe · on Jan 14, 2022

> Off the top of my head, most embedded databases implement a single writer model. LMDB, Voron (RavenDB’s storage engine), LevelDB, Lucene

And let's not forget sqlite!

> There can only be a single writer at a time to an SQLite database.

(from https://www.sqlite.org/isolation.html)

tptacek · on Jan 14, 2022

From that article: the whole fsyncgate thing seems like a pretty strong counterargument to "mmap adds more complexity than it removes":

https://danluu.com/fsyncgate/

ayende · on Jan 14, 2022

That actually doesn't matter This is orthogonal to mmap

aidenn0 · on Jan 14, 2022

As you pointed out in your article, it invalidates much (all?) of the the "Problem #3: Error handling" section of TFA.

dwenzek · on Jan 15, 2022

Thank you for these counter-arguments. It's good to have them to make up your own mind, especially when recognized experts use a mocking tone "you will no dare think the contrary".