I personally prefer the abstract to jumping straight into a full paper, especially since it's quite rich (not one of those two line entries like some arXiv paper abstracts). After reading the abstract I did end up opening the PDF.. but I'm hesitant to pay the PDF tax early. Is this one of those "original source" type decisions?
Yes. I hear you about the downside, but the downside of the more superficial-accessible 'home page' is that people will not read any further, and instead simply respond generically.
The current URL just redirects back to that "superficial-accessible 'home page'" anyway (probably as a substitute for 404 handling, I'd guess); if the intent is to link directly to the paper/PDF, you probably want https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf
But I agree with the other person; if people really won't read any further from the homepage, then I highly doubt they'd read any further than the headline and maybe abstract of the original paper anyway, so there ain't really much upside to linking directly to the paper - whereas there's quite a bit of downside for anyone who might feel inclined to watch the video instead (which essentially covers the same information as the paper, just at a higher level / without the same level of detail) or review the benchmark code - neither of which are accessible from the paper.
IMO you guys (as well as whoever did that) are underestimating the difference in how the two different kinds of submission affect resulting discussion. I did offer to change the top URL to point to the video, if they felt that was more important, but never heard back.
How many hours had it been between them making that change and you updating the URL again, though? If linking to just the PDF v. a homepage with the PDF + presentation + code would affect the resulting discussion, it's probably fair to say that the resulting discussion has already been affected (this particular conversation notwithstanding), no?
Even that aside, if the authors care so much about what people see first that they actively set a particular URL to redirect to their preference (EDIT: and have explicitly stated that preference in these comments, assuming "apavlo" is Andy Pavlo: https://news.ycombinator.com/item?id=29939332), should that preference not be respected?
The pragmatic consideration that usually influences the decision to use mmap() is the large discontinuity in skill and expertise required to replace it. Writing your own alternative to mmap() can be significantly superior in terms of performance and functionality, and often lends itself to a cleaner database architecture. However, this presumes a sufficiently sophisticated design for an mmap() replacement. The learning curve is steep and the critical nuances of sophisticated and practical designs are poorly explored in readily available literature, providing little in the way of "how-to" guides that you can lean on.
As a consequence, early attempts to replace mmap() are often quite poor. You don't know what you don't know, and details of the implementation that are often glossed over turn out to be critical in practice. For example, most people eventually figure out that LRU cache replacement is a bad idea, but many of the academic alternatives cause CPU cache thrashing in real systems, replacing one problem with another. There are clever and non-obvious design elements that can greatly mitigate this but they are treated as implementation details in most discussions of cache replacement and largely not discoverable if you are writing one for the first time.
While mmap() is a mediocre facility for a database, I think we also have to be cognizant that replacing it competently is not a trivial ask for most software engineers. If their learning curve is anything like mine, I went from mmap() to designing obvious alternatives with many poorly handled edge cases, and eventually figuring out how to design non-obvious alternatives that could smoothly handled very diverse workloads. That period of "poor alternatives" in the middle doesn't produce great databases but it almost feels necessary to properly grok the design problem. Most people would rather spend their time working on other parts of a database.
The original version of MongoDB used mmap, and I worked at a company that had a ton of issues with cache warmup and the cache getting trashed by competing processes. Granted this was a long time ago, but the main issue was the operating system's willingness to reallocate large swaths of memory from the address space to whatever process was asking for memory right now.
Once the working set got trashed, performance would go through the floor, and our app would slow to a crawl while the cache went through the warmup cycle.
Long story short, with that model, Mongo couldn't "own" the memory it was using, and this lead to chronic problems. Wiredtiger fixed this completely, but I still think this is a cautionary tale for anyone considering building a DB without a dedicated memory manager.
The original sales pitch I heard for slab alocators was: use the standard libraries for general workloads, but if you know your data better than the stdlib, you might be able to do better.
mmap access patterns seem like something where you can do better. Especially in the age of io_uring, when an n+1 pointer chasing situation doesn't particularly care what order the results are processed as long as the last one shows up in a reasonable amount of time.
Perhaps I misread your first sentence but was MongoDB related to your cache warming issue? Or were these two distinct issues related to mmap-based data stores?
I have written a couple of mmap() based time series databases. In my case, these were databases for holding video. For my uses, mmap() has been great. I strongly agree with your comment. Maybe mmap() isn't the greatest, but it has worked for me.
When you say “replacing mmap()”, could you elaborate a bit on it? The way you write it sounds like you’re describing a reimplementation of mmap() with the same API, while I believe the actual goal would be to completely rewrite the persistence and caching layer to be like a “real” database.
The implementation is essentially a complete replacement for the kernel page cache and I/O scheduler, much of the behavior of which is hidden behind the mmap() functions. It is never a drop-in replacement and you wouldn't want it to be but it is functionally quite similar.
For example, while the storage will usually have a linear address space, the "pointer" to that address space won't be a literal pointer even though it may behave much like one. There may be stricter invariants around write back and page fault behavior, and madvise()-like calls have deterministic effects. You often have cheap visibility into details of page/buffer state that you don't get with mmap() that can be used in program logic. And so on. Different but similar.
You have a file and a bunch of memory and you need to make sure data is being moved from file to memory when needed and from memory to file when needed.
mmap() is one algorithm to do it, and the idea is that it is not necessarily the best one.
Knowing more about your data and application and needs should theoretically enable you to design an algorithm that will be more efficient at moving data back and forth.
Those could already use available high-level DBs or libraries, rather than building own.
I guess if somebody decides to building a new market-grade database system from scratch, they should hire experienced IO specialists and perhaps also lawyers, as the cache eviction algos are patented.
Interesting parallels in this work to Tanenbaum's "RPC Considered Harmful"†; in both cases, you've got an abstraction that papers over a huge amount of complexity, and it ends up burning you because a lot of that complexity turns out to be pretty important and the abstraction has cost you control over it.
Whenever you need better performance and reliability, identify an abstraction beloved of CS professors, and bypass it.
When I last checked, libtorrent was utterly failing to use O_DIRECT semantics. I started making a patch, but there are several places that do file ops, and the main one was more complicated than I could afford to dive into at the time.
So... don't bypass abstractions, unless you actually have time to do a better job, no?
We have abstractions for a reason. We have lower-level primitives for a reason. Understanding the differences, reasoning about all trade-off angles, and making the right choice in each project is a majority of the software engineering job.
> So... don't bypass abstractions, unless you actually have time to do a better job, no?
And unless there's a clear benefit to it. If you haven't identified the abstraction as a significant bottleneck, then is it really worthwhile to go through the trouble of bypassing it?
The problem is that people like to carve out territories in their data architecture before they have become subject matter experts. Once you split two things it's so difficult to add certain kinds of features that most people just give up and deal with higher fanout.
What you often get is the sum being less than the whole of its parts, and trying to offset that by achieving greater 'parts' through coherence and conceptual integrity in isolation. There is such a thing as 'Coherent but wrong'.
You don't want the OS to take care of reading from disk and page caching/eviction. You want the DB itself to have explicit control over that, because the DB has information on access patterns and table format that the OS is not aware of. It is better equipped than the OS to anticipate what portions of tables/indices need to be cached in memory. It is better equipped to calculate when/where/what/how much to prefetch from disk. It is better equipped to determine when to buffer writes and when to flush to disk.
Sure, it might be more work than using mmap. But it's also more correct, forces you to handle edge cases, and much more amenable to platform-specific improvements a la kqueue/io_uring.
OTOH, if you care about that last 5 percent or so of performance there is the complexity that what the OS has optimized for might be different between different OS's (e.g., MacOS, Linux, FreeBSd, etc.) and indeed, might change between different versions of Linux, or even, in the case of buffered writeback, between different filesystems on the same version of Linux. This is probably historically one of the most important reasons why enterprise databases like Oracle DB, DB2, etc., have used direct I/O, and not buffered I/O or mmap.
Speaking as an OS developer, we're not going to try to optimize buffered I/O for a particular database. We'll be using becnhmarks like compilebench and postmark to optimize our I/O, and if your write patterns, or readahead patterns, or caching requirements, don't match those workloads, well.... sucks to be you.
I'll also point out that those big companies that actually pay the salarise of us file system developers (e.g., Oracle, Google, etc.) for the most part use Direct I/O for our performance critical workloads. If database companies that want to use mmap want to hire file system developers and contribute benchmarks and performance patches for ext4, xfs, etc., speaking as the ext4 maintainer, I'll welcome that, and we do have a weekly video conference where I'd love to have your engineers join to discuss your contributions. :-)
The mongodb developers once thought as you did. They were wrong, although it took a fair while for them to realise this. Yes it's complex. Extremely complex, and as another poster noted, the learning curve is horrible and documentation is extremely limited. Unfortunately there's no real substitute.
The mmap/madvise approach works well for things like varnish cache, where you have a flat collection of similar and largely unrelated objects. It does not work well for databases where you have many different types of data, some of which are interrelated, and all want to be handled differently. If you can meet the performance needs for your product by doing what you're doing then great - that's a fantastic complexity saving for your business. But the claim that "you can design your system so the access pattern that the OS is optimized for matches your needs" is unfortunately not true. It might be good enough for what you need, but it's not optimal. That's why there's so many lines of code in other DB engines doing this the hard way.
The MongoDB developers were morons. They used mmap poorly and gained none of its potential advantages. Their incompetence and failures are not an indictment against using mmap.
For as long as computers and databases have existed, there has been a war between DB designers and OS designers, with DB designers always claiming they have better knowledge of workloads than OS designers. That can only ever possibly be true when the DB is the only process running on the machine. Whenever anything else is also running on the machine that claim can not possibly be true.
Reality today is that nobody runs on dedicated machines. Everyone runs on "the cloud" where all hardware is shared with an unknown and arbitrary number of other users.
The counterargument to this is that the kernel can make decisions based on nonlocal information about the system.
If your database server is the only process in the system that is using significant memory, then sure, you might as well manage it yourself. But if there are multiple processes competing for memory, the kernel is better equipped to decide which processes' pages should be paged out or kept into memory.
Generally for perf critical use cases you dedicate the machine to the database. This simplifies many things (avoiding having to reason about sharing, etc etc).
This makes me wonder whether there would be value in an OS that is also a DBMS (or vice versa). In other words, if the DBMS has total control over the hardware, perhaps performance can be maximized without too much additional complexity.
That was back when hardware was changing to a significant degree, though. Nowadays, there ain't really much that's new about hardware today v. hardware from 10 or 20 years ago - hence operating systems / filesystems being able to remain mostly stable instead of suffering from the exact same problem.
The madvise() functions and similar are a blunt and imprecise instrument. The kernel is free to ignore them, and frequently does in practice. It also does not prevent the kernel from proactively doing things you don't want it to do with your buffer pool at the worst possible time.
A user space buffer pool gives you precise and deterministic control of many of these behaviors.
"If you aren’t using mmap, on the other hand, you still need to handle of all those issues"
Which seems like a reasonable statement. Is it less work to make your own top-to-bottom buffer pool, and would that necessarily avoid similar issues? Or is it less work to use mmap(), but address the issues?
Questdb's author here. I do share Ayende's sentiment. There are things that the OP paper doesn't mention, which can help mitigate some of the disadvantages:
- single-threaded calls to 'fallocate' will help avoiding sparse files and SIGBUS during memory write
- over-allocating, caching memory addresses and minimizing OS calls
- transactional safety can be implemented via shared memory model
- hugetlb can minimize TLB shootdowns
I personally do not have any regrets using mmap because of all the benefits they provide
I suppose. Some problems with mmap() are a bit hard to fix from user land though. You will hit contention on locks inside the kernel (mmap_sem) if the database does concurrent high throughput mmap()/unmap(). I don't follow linux kernel development closely to know if this has been improved recently, but it was easy to reproduce it 4-5 years ago.
That makes sense. I wasn't going right to the conclusion that working around mmap() issues was easier, but it didn't seem to be explored much. Is the contention around having one file mmap()ed, or is it reduced if you use more files?
When I worked on/with BerkeleyDB in the late 90s we came to the conclusion that the various OS mmap() implementations had been tweaked/fixed to the point where they worked for the popular high profile applications (in those days: Oracle). So it can appear like everything is fine, but that probably means your code behaves the same way as <popular database du jour>.
Um... Oracle (and other enterprise databases like DB2) don't use mmap. They use Direct I/O. Oracle does have anonymous (non-file-backed) memory which is mmap'ed and shared across various Oracle processes, called the Shared Global Area (SGA), but it's not used for I/O.
Some issues with mmap() can be avoided entirely if you have your own buffer pool. Others are easier to handle because they are made explicit and more buffer state is exposed to the program logic. That's the positive side.
The downside is that writing an excellent buffer pool is not trivial, especially if you haven't done it before. There are many cross-cutting design concerns that have to be accounted for. In my experience, an excellent C++ implementation tends to be on the order of 2,000 lines of code -- someone has to write that. It also isn't simple code, the logic is relatively dense and subtle.
Thank you for these counter-arguments. It's good to have them to make up your own mind, especially when recognized experts use a mocking tone "you will no dare think the contrary".
Choosing mmap() gets you something that works sooner than later.
But then you have a pile of blocking-style synchronous code likely exploiting problematic assumptions to rewrite when you realize you want something that doesn't just work, but works well.
And? So you're going to take a syscall (mincore()?) hit before every file-backed mapping access to test if it'd incur a page fault, and try switch to another coroutine if it would?
Syscalls aren't free, especially today, and mmap() is already bringing significant overhead to the party.
If you've brought coroutines into the picture, you might as well schedule them using async IO completions and kick mmap() to the curb.
One possible advantage of using mmap over a buffer pool can be programmer ergonomics.
Reading data into a buffer pool in process RAM takes time to warm up, and the pool can only be accessed by a single process. In contrast, for an mmap-backed data structure, assuming that files are static once written (which can be the case for an multi-version concurrency control (MVCC) architecture), you open an mmap read-only connection from any process and the so long as the data is already in the OS cache, you get instant fast reads. This makes managing database connections much easier, since connections are cheap and the programmer can just open as many as they want whenever and wherever they want.
It is true that cache eviction strategy used by the OS is likely to be suboptimal. So if you're in a position to only run a single database process, you might decide to make different tradeoffs.
This is true, but in the case where files are read only, just reading directly from the files with fread()/read()/etc works pretty well. You do have to pay the cost of a system call and a copy from the OS buffer cache into your user-space buffer, but OTOH when the page isn't in the buffer cache, the cost of reading the required data from storage is more predictable than the cost of faulting in all the 4kb pages you're reading.
Makes me wonder if there is an alternative universe in which there is a syscall with semantics similar to mmap that avoids these pitfalls. It's not like mmap's semantics are the only semantics that we could have for memory-mapped IO.
This would be exactly the kind of innovation we would need in computer science. Instead we often get stuck in local minima (in this case a 40-year old POSIX interface) without realizing how much pain this causes.
I worked at a company that developed its own proprietary database for a financial application. The entire database, several gigabytes, not large by today's standards, was mmap'd and read at startup to warm the page cache. We also built in-memory "indexes" at startup.
This was back in the early 2000's, when having 4 gigabytes of RAM would be considered large. The "database server" was single threaded, and all changes were logged to a WAL-ish file before updating the mmap'd database. It was fun stuff to work on. It worked well, but it wasn't a general purpose DB.
How do you implement lockless atomic updates for multiple writers across multiple threads & processes without mmap?
With mmap it is straight forward for processes to open persistent arrays of atomics as a file, and use compare and exchange operations to prevent data races when multiple threads or processes update the same page without any file locks, advisory locks, or mutexes.
With manual read() and write() calls, the data may be overwritten by another writer before the update is committed.
Normally, your IPC structures where you put lock-free data structures are mmaped in tmpfs, which is backed by RAM only, not files. A lot of the problems with mmap-ed files only show up when the file is larger than RAM (which is the case with databases). Files for IPC in tmpfs are usually small and don't have that problem.
No. _I_ want to know what we're talking about, as my original question clearly indicates.
You can do "lock-free memory based interprocess communication" with memory (obviously). There is no need to back this memory with files, certainly not files on a hard drive that you would otherwise access using read() and write(). Hence my original question.
First, I didn't assume it a requirement to have two processes read() and write() _directly_ to the same memory (I suppose you meant "file region" here). And idk, it might not be a good idea to require that.
Also, you can use normal (non-file-backed) memory to do the necessary synchronization (lock-free or not). I'm still not seeing why the memory should be backed by a file, that's why I was genuinely asking. One reason why it could be practical that I can now see could be for an embedded database like sqlite, but again I'm not sure it would be a good idea. While it would allow for pretty much setup-less synchronization of otherwise uncoordinated processes, it's a fringe application that might be better implemented with one big flock(). And one reason why it could be not a good idea is that it might couple the file format to a particular CPU architecture.
Another big issue I guess is that the atomics actually do have an effect to the underlying file whenever the pages are flushed. What if the computer shuts down unexpectedly? The synchronization affairs aren't cleaned up, yet the original processes are gone.
You weren't asking, you were saying it wasn't necessary, which you did in the sentence right before this one:
Also, you can use normal (non-file-backed) memory to do the necessary synchronization (lock-free or not).
Again, this is just a repeated claim, it isn't an explanation. How do you have two processes writing to the same place in memory without memory mapping a file?
have two processes read() and write() _directly_ to the same memory
I didn't say read() and write() I said read and write as in reading and writing with memory addresses. Again, this is all about lock free interprocess communication. You can't write outside your own memory from a process with normal permissions so how do you share memory with another process?
You memory map the same file. This isn't about the file being written to some sort of persistent storage, that happens on the OS level and doesn't interfere with two running processes communicating with each other. The file can be deleted after the last process closes it. It is just a way for the two processes to have memory mapped into their virtual memory space that overlaps with each other.
You need to deal with memory directly so you can use atomics. You need to use atomics so you can avoid locks.
I thought you might have had some other technique that I'm not aware of but it seems now you were making claims without much behind them, which is disappointing.
These are still memory mapping files using file paths and returning file descriptors as far as I know, which makes sense because you have to have something coordinated between the two processes.
> These are still memory mapping files using file paths ...
No, they're not. The entire purpose of MAP_ANONYMOUS is to avoid using files.
Sources:
1. The Linux Kernel source code [1], where it comes with the code comment: "don't use a file".
2. The glibc source code [2], where it comes with the same code comment: "Don't use a file".
3. The Linux man-pages project documentation of mmap [3], where it is documented thus: "The mapping is not backed by any file; its contents are initialized to zero. The fd argument is ignored"
Similarly for SHM, but if you still don't get the point about MAP_ANONYMOUS, I doubt you'll get it for SHM either.
> ... and returning file descriptors
A socket is a file descriptor. An epoll handle is a file descriptor. On modern Linux kernels, a pid handle is a file descriptor. None of them are backed by "files".
> ... because you have to have something coordinated between the two processes.
FDs are not the only things processes can share, even if you go back to the venerable, original Unices, so I don't see what you mean.
This just goes back to the same question - what do two processes use to map the same memory into their memory space if it isn't a path to a file?
I'm not saying there isn't anything, I'm just seeing an extreme avoidance to an actual answer. The other guy went down a rabbit hole of syncing that memory to storage, which has nothing to do with anything.
I'm starting to think you're even more confused than I had assumed. You were literally given a reasonable possible answer to your question multiple times (MAP_ANONYMOUS). And if there wasn't a big confusion you wouldn't be asking these questions in the first place because you could just make up your own answer.
I'm also left uncertain if you're assuming Linux and not talking about it. At least your objections to general statements are weirdly specific, while you never clarify the context (e.g. what OS you're talking about), and you seem to assume that there couldn't be other ways of achieving stuff. There seems to be a weird lack of understanding of the basics in your comments.
At the core, everything you need to share memory is that the participating processes agree about the (physical) address range of that memory (e.g. a 64-bit starting address and a 64-bit size). You could literally hardcode a physical address range, map this range to arbitrary (and possibly different) virtual address ranges in each of the processes, and start communicating through that shared memory. Note that the mappings are stored in the RAM and CPU, it has nothing at all to do with any files or filepaths.
And this whole discussion is completely pointless anyway because it started of YOU misunderstanding what I meant by "file-backed memory", which is not my fault at all. The term is completely unambiguous, it means (as opposed to POSIX SHM / MAP_ANONYMOUS / whatever) page cache memory that gets synced to an underlying file on a filesystem.
Please stop questioning and start experimenting and understanding what we're saying. We know what we're talking about. You don't.
"MAP_ANONYMOUS|MAP_SHARED mapped memory can only be accessed by the process which does that mmap() call or its child processes. There is no way for another process to map the same memory because that memory can not be referred to from elsewhere since it is anonymous."
misunderstanding what I meant by "file-backed memory"
No, it started by talking about using atomics for lock free interprocess communication, something MAP_ANONYMOUS can't do.
You hallucinated writing to storage as being part of this, didn't explain yourself and are getting upset about it. Atomic instructions that manipulate memory is orthogonal to what the OS does is the background. No one would think an operation on the order of nanoseconds has anything to do with writing permanent storage.
clarify the context (e.g. what OS you're talking about)
This thread is about mmap - it says it in the title.
it has nothing at all to do with any files or filepaths.
Two processes need some way to map the same memory and they do it through file paths.
> This thread is about mmap - it says it in the title.
I was asking what YOU are talking about. And also, this thread is actually about the approach of memory-mapped file I/O, not about POSIX mmap() specifically.
That's why I was (clearly) making statements that are not tied to any particular OS or platform, from the beginning.
If your boss said "we need these two programs to have lock free IPC through memory" and you said "use MAP_ANONYMOUS" they would say "that is local to the process tree and won't work".
You can try to ignore the context of this thread, but if someone wants IPC, this doesn't work.
> But then that isn't interproccess communication.
It is. It may not be _generic_ IPC, but it is IPC all the same. E.g., this is how postgres does IPC across its processes.
> that is local to the process tree and won't work
Isn't that what SHM is for? But, oh I see, you're willfully ignoring the fact that SHM keys _are not file paths_. So, yeah, I guess in _your_ world, non-file-backed IPC can't work.
> If your boss said ...
Sucks to be your boss, since _you_ don't get the fact that SHM keys and the filesystem are entirely separate namespaces.
> You weren't asking, you were saying it wasn't necessary, which you did in the sentence right before this one:
quoting my OP: " Why do you need lockless atomic updates to a file-backed memory area? Genuinely curious. " . Dude.
> it seems now you were making claims without much behind them, which is disappointing.
Well thank you very much.
I get the feeling we might just be talking about the same thing. Or we might be not, I'm not sure.
> How do you have two processes writing to the same place in memory without memory mapping a file?
> You can't write outside your own memory from a process with normal permissions so how do you share memory with another process?
For example on Linux, use shm_open() + mmap(). This is just an example, and granted it uses a file-like API (shared memory objects show up on /dev/shm on a typical Linux) but it is not "file-backed" (I meant disk backed and this might be the misunderstanding) and in particular it's certainly not mapping the database file. It's just one way on one OS to map the same physical memory into different processes' address spaces.
If this example approach is "file-backed" to you, then so be it but I think you have willfully misread my comments up to here.
Homework: go back through my comments and identify all the places where I was VERY CLEARLY pointing out that my statement is that no disk-backed file is needed, or where you could reasonably infer this from my use of the term "file-backed", as well as from the general context of the discussion.
> shm_open("/TESTOBJECT"
>>That's a file path
Pedantically, no. It's a name (https://man7.org/linux/man-pages/man3/shm_open.3.html) that identifies a memory object that is only coincidentally also mapped to the file path "/dev/shm/TESTOBJECT" on a typical linux. shm_open() returns an "FD", though.
On Linux, as a sibling poster noted, you could also use mmap(.. MAP_SHARED | MAP_ANONYMOUS, /*fd*/ -1 ...) , which to my knowledge is entirely "file-free" by any meaning of the term "file". But then again, in my understanding this would only work with child processes because that mapping has to be inherited.
On other OSes, there may be completely different APIs to map shared memory that don't involve anything "file" like, either. Quite honestly I can't point you to any because I do only Linux and Windows, but let's just end the discussion here and let's agree that memory != file. I'm angry at myself for wasting another evening fighting a pointless discussion with somebody who would rather argue than try to get my point.
You conflated files with disks on your own. No one did that for you.
rather argue than try to get my point.
I still don't know what your point is. You have to have something that coordinates between two processes for shared memory interprocess communication and that ends up being file paths for the OS. You asked questions, they were answered and you could have learned something.
The whole point was actually that you can map the same memory into two different processes and use atomics, which is an incredible technique. For some reason you wanted to ignore that and make claims without explanation.
If you didn't want to waste time, you would have explained what you meant or asked questions.
> If you didn't want to waste time, you would have explained what you meant or asked questions.
You clearly haven't done your homework, because I did.
> You conflated files with disks on your own. No one did that for you.
I did not really conflate this. It is just conventional but imprecise terminology, and everyone who gets into such a discussion (especially when starting personal attacks) is expected to know to be careful when one hears "file" that it could mean "filepath", "file descriptor", or "file data" - especially "persistent file data" / "file storage", and that it could or could not mean something specific Unix-y or not Unix-y, or just some unspecific "data object". My usage of the term "file-backed" is definitely clear enough. More so given all the other explanations I made. Even more in the context of mmapping database files.
How about this: You yourself are the one who wasn't clear (or just wrong, not really understanding virtual memory), and I was the one clarifying myself multiple times, and I was the one just trying to make a simple point that could be easily understood by not being stubborn.
> The whole point was actually that you can map the same memory into two different processes and use atomics, which is an incredible technique. For some reason you wanted to ignore that and make claims without explanation.
I never ignored that but said from the beginning that you should share memory, but not file-backed memory. It's standard to share memory between processes and threads (especially threads), not an "incredible technique". It's an essential part of virtual memory management.
Go right back here to my first reply to your first reply, https://news.ycombinator.com/item?id=29943137 . Which has it all. "Because it allows you to do lock free memory based interprocess communication, which can be extremely fast." > " There is no need for file-backed memory to do that. ". Also go read my OP's sibling comment. Go read TFA, or just the title of this discussion. How can you not stop pretending you were just caught in an argument that you could not get out of without acknowledging you were wrong?
My very next comment: https://news.ycombinator.com/item?id=29947339 , "You can do "lock-free memory based interprocess communication" with memory (obviously). There is no need to back this memory with files". That comment also explains the problems of using a persistent file as backing. WHAT THE HELL STOP PRETENDING I WASN'T CLEAR THAT THIS IS ABOUT FILES ON DISK.
The next comment: "you can use normal (non-file-backed) memory to do the necessary synchronization (lock-free or not). I'm still not seeing why the memory should be backed by a file"
Then you wouldn't explain it and eventually admit that you do need to have a file path to give to another process, but only after I asked you to show what you meant multiple times.
And there isn't. It seems you just don't really understand virtual memory, and don't want to acknowledge what everyone else understands by "file-backed memory". And given that I find it courageous how stubborn you are, as well as starting personal attacks.
> Then you wouldn't explain it and eventually admit that you do need to have a file path
Need to have a file path IN WHICH ENVIRONMENT, IN WHICH CONTEXT??? Could YOU please clarify. We can easily make a simple OS which doesn't have "files" but does have processes that can share memory using virtual memory technology.
Shared memory IPC is fundamentally not about files, and you were even shown a way to setup shared memory mappings between Linux processes using normal userland API entirely without the use of files or file paths - with the restriction that the mappings have to be inherited (fork()).
How someone, even with no real understanding of the topic, could not at the latest at https://news.ycombinator.com/item?id=29947339 acknowledge that I was being perfectly clear that I was talking about persistent files (I literally said on a hard drive), is beyond me. I should have stopped this discussion at that point.
Files being persistent on storage has nothing to do with communicating through shared memory. It isn't necessary and it doesn't interfere if it's there. It is completely orthogonal, I don't know why it would ever be a part of the conversation when talking about direct reading and writing to the same memory.
> Files being persistent on storage has nothing to do with communicating through shared memory.
Files (whether persistent or not) have not really anything to do with communication through shared memory. In the implementation of an API like shm_open(), the VFS (virtual filesystem) is simply the address space and lookup mechanism that an operating system like Linux happens to use in order to find the memory that should be shared.
> It isn't necessary and it doesn't interfere if it's there.
Sure it does interfere. By backing memory needlessly with a persistent file, you're causing disk I/O from the loading and flushing (that can't really be controlled) and potentially bad performance.
Also, as explained, if you use a persistent file to track the synchronization state, the synchronization state won't be reset when the communicating processes die unexpectedly, and this might be problematic.
system like Linux happens to use in order to find the memory that should be shared.
Right. Is there some other mechanism to coordinate mapping the same memory between processes? That's all I ever asked.
Sure it does interfere. By backing memory needlessly with a persistent file, you're causing disk I/O from the loading and flushing (that can't really be controlled) and potentially bad performance.
That is orthogonal, since once you have the memory mapped into both processes you can use atomics for lock free IPC. That's the whole thing. It doesn't matter what the OS does or doesn't do in the background, atomically reading and writing to memory is unaffected.
I have a great deal of experience in running very large memory-mapped databases using LMDB.
The default Linux settings dealing with memory mapped files are pretty horrible. The observed poor performance is directly related to not configuring several very important kernel parameters.
These settings control writing back modified pages. The experiments in the paper are read-only. With writes the situation is even worse than shown in the paper (though tuning these settings may help a bit).
Yup. Why use the operating system's async I/O system when you can simply burn a thread and do blocking I/O? </snark>
Been down that primrose path, have the road rash to prove it. mmap() is great until you realize that pretty much all you've avoided is some buffer management that you probably need to do anyway. The OS just doesn't have the information it needs to do a great (or even correct) job of caching database pages.
> Why use the operating system's async I/O system when you can simply burn a thread and do blocking I/O? </snark>
mmap isn't non-blocking; page faults are blocking, no different from a read or write to a (non-direct I/O) file using a syscall.
Until recently io_uring literally burned a thread (from a thread pool) for every read or write regular file operation, too. Though now it finally has hooks into the buffer cache so it can opportunistically perform the operation from the same thread that dequeued the command, pushing it to a worker thread if it would need to wait for a cache fault.[1]
[1] Technically the same behavior could be implemented in user space using userfaultfd, but the latency would likely be higher on faults.
A user process doesn't have the information it needs to do a good job of coordinating updates from multiple writers to database pages and indices. With MMAP, writers have access to shared atomics which they can update using compare-exchange operations to prevent data races which would be common when using read() and write() without locks.
There can be a data race any time a processor loads a value, modifies it, and writes it back. Without an atomic update operation like compare_exchange() generally you need to lock the database file against other processes and threads. The typical solution is to only have one process update the file, only have one thread perform the writes, and combine it with a TCP server.
Suppose you have a big data file and want to mark which pages are occupied and which pages are free. Suppose a writer wants to read a bit from an index page to the stack to check whether a data page is occupied, modify the page bit in the stack to claim the data page if another process hasn't claimed it, and write the updated value back to memory to claim the data page to store the data value if another process hasn't claimed it.
If each process read()s the index bits, they can both see that page 2 bit in the index is unset and try to claim it, then write() back the updated index value. The updates to the index will collide, both writers will think the claimed page 2 when only one should have, and one of the data values written to that page will get lost.
This was never up for debate and is more diversion.
Was someone "saying you have to use mmap or you get data races??"
No, no one was saying that. You need it to do lock free synchronization because you need to map the same memory into two different processes to use atomics.
Most of the times I used mmap I wasn't happy in the end.
I went through a phase when I thought it was fun to do extreme random access on image files, archives and things like that. At some point I think "I want to do this for a file I fetch over the network" and that needs a rewrite.
Thank you for sharing your DB course(s) videos on the YouTube. I'm a CMU staff member (Open Learning Initiative) that would never be able to enroll on-site, given likely my lower priority for getting a seat, but watching your videos online has been fantastic.
https://www.youtube.com/watch?v=1BRGU_AS25c
and this code: https://github.com/viktorleis/mmapbench