The pragmatic consideration that usually influences the decision to use mmap() i...

tmountain · on Jan 14, 2022

The original version of MongoDB used mmap, and I worked at a company that had a ton of issues with cache warmup and the cache getting trashed by competing processes. Granted this was a long time ago, but the main issue was the operating system's willingness to reallocate large swaths of memory from the address space to whatever process was asking for memory right now.

Once the working set got trashed, performance would go through the floor, and our app would slow to a crawl while the cache went through the warmup cycle.

Long story short, with that model, Mongo couldn't "own" the memory it was using, and this lead to chronic problems. Wiredtiger fixed this completely, but I still think this is a cautionary tale for anyone considering building a DB without a dedicated memory manager.

hinkley · on Jan 15, 2022

The original sales pitch I heard for slab alocators was: use the standard libraries for general workloads, but if you know your data better than the stdlib, you might be able to do better.

mmap access patterns seem like something where you can do better. Especially in the age of io_uring, when an n+1 pointer chasing situation doesn't particularly care what order the results are processed as long as the last one shows up in a reasonable amount of time.

bogomipz · on Jan 15, 2022

Perhaps I misread your first sentence but was MongoDB related to your cache warming issue? Or were these two distinct issues related to mmap-based data stores?

dicroce · on Jan 14, 2022

I have written a couple of mmap() based time series databases. In my case, these were databases for holding video. For my uses, mmap() has been great. I strongly agree with your comment. Maybe mmap() isn't the greatest, but it has worked for me.

stingraycharles · on Jan 14, 2022

When you say “replacing mmap()”, could you elaborate a bit on it? The way you write it sounds like you’re describing a reimplementation of mmap() with the same API, while I believe the actual goal would be to completely rewrite the persistence and caching layer to be like a “real” database.

jandrewrogers · on Jan 14, 2022

The implementation is essentially a complete replacement for the kernel page cache and I/O scheduler, much of the behavior of which is hidden behind the mmap() functions. It is never a drop-in replacement and you wouldn't want it to be but it is functionally quite similar.

For example, while the storage will usually have a linear address space, the "pointer" to that address space won't be a literal pointer even though it may behave much like one. There may be stricter invariants around write back and page fault behavior, and madvise()-like calls have deterministic effects. You often have cheap visibility into details of page/buffer state that you don't get with mmap() that can be used in program logic. And so on. Different but similar.

lmilcin · on Jan 14, 2022

The task is deceivingly simple.

You have a file and a bunch of memory and you need to make sure data is being moved from file to memory when needed and from memory to file when needed.

mmap() is one algorithm to do it, and the idea is that it is not necessarily the best one.

Knowing more about your data and application and needs should theoretically enable you to design an algorithm that will be more efficient at moving data back and forth.

idiocrat · on Jan 15, 2022

> most software engineers

Those could already use available high-level DBs or libraries, rather than building own.

I guess if somebody decides to building a new market-grade database system from scratch, they should hire experienced IO specialists and perhaps also lawyers, as the cache eviction algos are patented.