Hacker News new | past | comments | ask | show | jobs | submit login

The pragmatic consideration that usually influences the decision to use mmap() is the large discontinuity in skill and expertise required to replace it. Writing your own alternative to mmap() can be significantly superior in terms of performance and functionality, and often lends itself to a cleaner database architecture. However, this presumes a sufficiently sophisticated design for an mmap() replacement. The learning curve is steep and the critical nuances of sophisticated and practical designs are poorly explored in readily available literature, providing little in the way of "how-to" guides that you can lean on.

As a consequence, early attempts to replace mmap() are often quite poor. You don't know what you don't know, and details of the implementation that are often glossed over turn out to be critical in practice. For example, most people eventually figure out that LRU cache replacement is a bad idea, but many of the academic alternatives cause CPU cache thrashing in real systems, replacing one problem with another. There are clever and non-obvious design elements that can greatly mitigate this but they are treated as implementation details in most discussions of cache replacement and largely not discoverable if you are writing one for the first time.

While mmap() is a mediocre facility for a database, I think we also have to be cognizant that replacing it competently is not a trivial ask for most software engineers. If their learning curve is anything like mine, I went from mmap() to designing obvious alternatives with many poorly handled edge cases, and eventually figuring out how to design non-obvious alternatives that could smoothly handled very diverse workloads. That period of "poor alternatives" in the middle doesn't produce great databases but it almost feels necessary to properly grok the design problem. Most people would rather spend their time working on other parts of a database.




The original version of MongoDB used mmap, and I worked at a company that had a ton of issues with cache warmup and the cache getting trashed by competing processes. Granted this was a long time ago, but the main issue was the operating system's willingness to reallocate large swaths of memory from the address space to whatever process was asking for memory right now.

Once the working set got trashed, performance would go through the floor, and our app would slow to a crawl while the cache went through the warmup cycle.

Long story short, with that model, Mongo couldn't "own" the memory it was using, and this lead to chronic problems. Wiredtiger fixed this completely, but I still think this is a cautionary tale for anyone considering building a DB without a dedicated memory manager.


The original sales pitch I heard for slab alocators was: use the standard libraries for general workloads, but if you know your data better than the stdlib, you might be able to do better.

mmap access patterns seem like something where you can do better. Especially in the age of io_uring, when an n+1 pointer chasing situation doesn't particularly care what order the results are processed as long as the last one shows up in a reasonable amount of time.


Perhaps I misread your first sentence but was MongoDB related to your cache warming issue? Or were these two distinct issues related to mmap-based data stores?


I have written a couple of mmap() based time series databases. In my case, these were databases for holding video. For my uses, mmap() has been great. I strongly agree with your comment. Maybe mmap() isn't the greatest, but it has worked for me.


When you say “replacing mmap()”, could you elaborate a bit on it? The way you write it sounds like you’re describing a reimplementation of mmap() with the same API, while I believe the actual goal would be to completely rewrite the persistence and caching layer to be like a “real” database.


The implementation is essentially a complete replacement for the kernel page cache and I/O scheduler, much of the behavior of which is hidden behind the mmap() functions. It is never a drop-in replacement and you wouldn't want it to be but it is functionally quite similar.

For example, while the storage will usually have a linear address space, the "pointer" to that address space won't be a literal pointer even though it may behave much like one. There may be stricter invariants around write back and page fault behavior, and madvise()-like calls have deterministic effects. You often have cheap visibility into details of page/buffer state that you don't get with mmap() that can be used in program logic. And so on. Different but similar.


The task is deceivingly simple.

You have a file and a bunch of memory and you need to make sure data is being moved from file to memory when needed and from memory to file when needed.

mmap() is one algorithm to do it, and the idea is that it is not necessarily the best one.

Knowing more about your data and application and needs should theoretically enable you to design an algorithm that will be more efficient at moving data back and forth.


> most software engineers

Those could already use available high-level DBs or libraries, rather than building own.

I guess if somebody decides to building a new market-grade database system from scratch, they should hire experienced IO specialists and perhaps also lawyers, as the cache eviction algos are patented.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: