Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Comprehensive inter-process communication (IPC) toolkit in modern C++ (github.com/flow-ipc)
88 points by ygoldfeld on April 14, 2024 | hide | past | favorite | 57 comments
If you work in C++, and you would like 2+ programs to share data structures (and/or native I/O handles a.k.a. FDs) among each other, there is a good chance Flow-IPC

- will make it considerably less annoying to code than typical approaches; and

- may massively reduce the latency involved.

Those sharing Cap'n Proto-encoded data may have particular interest. Cap'n Proto (https://capnproto.org) is fantastic at its core task - in-place serialization with zero-copy - and we wanted to make the IPC (inter-process communication) involving capnp-serialized messages be zero-copy, end-to-end.

That said, we paid equal attention to other varieties of payload; it's not limited to capnp-encoded messages. For example there is painless (<-- I hope!) zero-copy transmission of arbitrary combinations of STL-compliant native C++ data structures.

To help determine whether Flow-IPC is relevant to you we wrote an intro blog post. It works through an example, summarizes the available features, and has some performance results. https://www.linode.com/blog/open-source/flow-ipc-introductio...

Of course there's nothing wrong with going straight to the GitHub link and getting into the README and docs.

Currently Flow-IPC is for Linux. (macOS/ARM64 and Windows support could follow soon, depending on demand/contributions.)




I also went this route and came to the very same conclusions. Cap'n proto for fast reading, SHM for shared data, simple short messaging, just everything in C.

My only problem is MacOS with its too small default SHM buffers, you need to enhance them. Most solutions need a reboot, but a simple setter is enough. Like sudo sysctl -w kern.sysv.shmmax=16777216


Interesting! I'd best write this down. Current notes on macOS and Windows port work:

https://github.com/Flow-IPC/ipc/issues/101 (<= https://github.com/orgs/Flow-IPC/discussions/98)

For macOS/ARM64, currently it looks to me like the apparent lack of /dev/shm equivalent (unless I messed up in searching for it) means the most significant amount of new work necessary to port it ... but you just mentioned a thing I did not know about. (SHM size/count limits definitely were a thing in Linux, though, indeed.) TY


I wrote some shm API wrappers here: https://github.com/rurban/smart/blob/new/source/algos/includ...

Never use /dev/shm directly unless Linux only.


Oh… Sys V SHM… not POSIX. Old-school.


There's no reason to use (the very ancient) SHM API over mmap, not today.

You can literally do everything with mmap that you can do with shm, without hitting OS caps, no performance penalty and with a code that's simpler.


Well except the sharing memory part. Linux added memfds but macos doesn't have that.

I imagine on macos you'd have to use the mach APIs if you're avoiding shm.


Tbh on macOS you should probably use xpc / Mach and that will probably let you do more than a generic ipc library. Of course caveat emptor it’s not portable


I would really not trust XPC to be any relevant when you care about raw performance and latency. Whole OS is slow as molasses.


Does the schema help a lot? For C++ you can get very fast without, for example with IceOryx https://github.com/eclipse-iceoryx/iceoryx

In contrast to Cap'n'Proto you get compiler optimized struct layout as benefit from using raw structs. Benchmarks are here https://iceoryx.io/v2.0.2/examples/iceperf/


I think that (whether a native-struct versus a capnp schema-based struct = helps/how much) is a general question of what kind of serialization is best for a particular use-case. I wouldn't want to litigate that here fully. Personally though I've found capnp-based IPC protocols to be neat and helpful, across versions and protocol changes (where e.g. there are well-defined rules of forward-compatibility; and Flow-IPC gives you niceties including request-response and message-type demultiplexing to a particular handler). [footnote 1 below]

BUT!!! Some algorithms don't require an "IPC protocol" per se, necessarily, but more like = 2+ applications collaborating on a data structure. In that case native structures are for sure superior, or at times even essentially required. (E.g., if you have some custom optimized hash-table -- you're not going to want to express it as a capnp structure probably.)

So, more to the point:

- Flow-IPC 100% supports transmitting/sharing (and constructing, and auto-destroyting) native C++ structures. Compared to iceoryx, on this point, it appears to have some extra capabilities, namely full support for structures with pointers/references and/or STL-compliant containers. (This example https://iceoryx.io/latest/examples/complexdata/ and other pages say things like, "To implement zero-copy data transfer we use a shared memory approach. This requires that every data structure needs to be entirely contained in the shared memory and must not internally use pointers or references. The complete list of restrictions can be found...".) Flow-IPC, in this context, means no need to write custom containers sans heap-use, or eliminate pointers in an existing structure. [footnote 2 below]

- Indeed, the capnp framing (only if you choose to use the Flow-IPC capnp-protocol feature in question!) adds processing and thus some computational and RAM-use overhead. For many applications, the 10s of microseconds added there don't matter much -- as long as they are constant regardless of structure size, and as long as they are 10s of microseconds. So a 100usec (modulo processor model of course!) RTT (size-independent) is pretty good still. Of course I would never claim this overhead doesn't matter to anyone, and iceoryx's results here are straight-up admirable.

[footnote 1] The request/response/demultiplexing/etc. niceties added by Flow-IPC's capnp-protocol feature-in-question work well IMO, but one might prefer the sweet RPC-semantics + promise pipelining of capnp-RPC. Kenton V (capnp inventor/owner) and I have spoken recently about using Flow-IPC to zero-copy-ify capnp-RPC. I'm looking into it! (He suspects it is pretty simple/natural, given that we handle the capnp-serialization layer already, and capnp-RPC is built on that.) This wouldn't change Flow-IPC's existing features but rather exercise another way of using them. In a way Flow-IPC provides a simple-but-effective-out-of-the-box schema-based conversation protocol via capnp-serialization, and capnp-RPC would provide an alternate (to that out-of-the-box guy) conversation protocol option. I tried pretty hard to design Flow-IPC in a grounded and layered way, so such work would be natural as opposed to daunting.

[footnote 2] In fact the Flow-IPC capnp-based structured-channel feature (internally) itself uses Flow-IPC's own native-structure-transmission feature in its implementation (eat our own dog-food). Since a capnp serialization = sequence of buffers (a.k.a. segments), for us it is (internally) represented as essentially an STL list<vector<uint8_t>>. So we construct/build one of those in SHM (internally); then only a small SHM-handle is (internally) transmitted over the IPC-transport [footnote 3]; and the receiver then obtains the in-place list<vector<uint8_t>> (essentially) which is then treated as the capnp-encoding it really is. This would all happen (internally) when executing the quite-short example in the blog (https://www.linode.com/blog/open-source/flow-ipc-introductio...). As you can see there, to the Flow-IPC-using developer, it's just -- like -- "create a message with this schema here, call some mutators, send"; and conversely "receive a message expected to have that (same) schema, OK -- got it; call some accessors."

[footnote 3] IPC-transport = Unix domain socket or one 2 MQ types -- you can choose via template arg (or add your own IPC-transport by implementing a certain pair of simple concepts).


Thank you very much for this excellent explanation! I am one of the fathers of IceOryx and it's predecessor. We had to lift component based embedded development to Posix systems and are very latency and memory bandwidth sensitive (driver assistance and automated driving on what most people would call small SoCs). There it is easier to enforce the senders and receivers to use the same struct.

What you did with the shm arena and sharing std containers is outright amazing and indeed relaxes the "self contained" constraint nicely.

On QNX (up to 7) we were bitten by each syscall going through procnto, that is why we have chosen lockfree over mq btw.

Being aware of the use case and choosing the right tradeoff is crucial, as you wrote.


Now I'm curious. It's seems you are not the father I'm still drinking beer with. This means there is only one person left that fits this attribute :) ... we should meet for some beer with the other father ;)


Got me. Next time I'm in Berlin we'll do... ;) Good job with IceOryx2, guys!


We are waiting with some salt & vinegar crisps ;)


Nice, but please no T-Shirts on train platforms... =;-D


I'm one of the iceoryx mantainers. Great to see some new players in this field. Competition leads to innovation and maybe we can even collaborate in some areas :)

I did not yet look at the code but you made me curious with the raw pointers. Do you found a way to make this work without serialization or mapping the shm to the same address in all processes?

I will have a closer look at the jemmaloc integration since we had something similar in mind with iceoryx2.


We are doing it with fancy-pointers (yes, that is the actual technical term in C++ land) and allocators. It’s open-source, so it’s not like there’s any hidden magic, of course: “Just” a matter of working through it.

Using manual mapping (same address values on both sides, as you mentioned) was one idea that a couple people preferred, but I was the one who was against it, and ultimately this was heeded. So that meant:

Raw pointer T* becomes Allocator<T>::pointer. So if user happens to enjoy using raw pointers directly in their structures, they do need to make that change. But, beats rewriting the whole thing… by a lot.

container<T> becomes container<T, Allocator<T>>, where `container` was your standard or standard-compliant (uses allocator properly) container of choice. So if user prefers sanity and thus uses containers (including custom ones they developed or third-party STL-compliant ones), they do need to use an allocator template argument in the declaration of the container-typed member.

But, that’s it - no other changes in data structure (which can be nested and combined and …) to make it SHM-sharable.

We in library “just” have to provide the SHM-friendly Allocator<T> for user to use. And, since stateful allocators are essentially unusable by mere humans in my subjective opinion (boost.interprocess authors disagree apparently), use a particular trick to work with an individual SHM arena. “Activator” API.

So that leaves the mere topic of this SHM-friendly fancy-pointer type, which we provide.

For SHM-classic mode (if you’re cool with one SHM arena = one SHM segment and both sides being able to write to SHM; and boost.interprocess alloc algorithm) —- enabled with a template arg switch when setting up your session object —- that’s just good ol’ offset_ptr.

For SHM-jemalloc (which leverages jemalloc, and hence is multi-segment and cool like that, plus with better segregation/safety between the sides) internally there are multiple SHM-segments, so offset_ptr is insufficient. Hence we wrote a fancy-pointer for the allocator, which encodes the SHM segment ID and offset within the 64 bits. That sounds haxory and hardcore, but it’s not so bad really. BUT! It needs to also be able to be able to point outside SHM (e.g., into stack which is often used when locally building up a structure), so it needs to be able to encode an actually-raw vaddr also. And still use 64 bits, not more. Soooo I used pointer tagging, as not all 64 bits of a vaddr carry information.

So that’s how it all works internally. But hopefully to the user none of these details is necessary to understand. Use our allocator when declaring container members. Use allocator’s fancy-pointer type alias (or similar alias, we give ya the aliases conveniently hopefully) when declaring a direct pointer member. And specify which SHM-backing technique you want us to internally use - depending on your safety and allocation perf desires (currently available choices are SHM-classic and SHM-jemalloc).


Hehe, we are also using fancy-pointer in some places :)

We started with mapping to the shm to the same address but soon noticed that it was not a good idea. It works until some application already mapped something to the same address. It's good that you did not went that route.

I hoped you had an epiphany and found a nice solution for the raw-pointer problem without the need to change them and we could borrow that idea :) Replacing the raw-pointer with fancy-pointer is indeed much simpler than replacing the whole logic.

Since the raw-pointer need to be replaced by fancy-pointer, how do you handle STL container? Is there a way to replace the pointer type or some other magic?

Hehe, we have something called 'relative_ptr' which also tracks the segment ID + offset. It is a struct of two uint64_t though. Later on, we needed to condense it to 64 bit to prevent torn writes in our lock-free queue exchange. We went the same route and encoded the segment ID in the upper 16 bits since only 48 bits are used for addressing. It's kind of funny that other devs also converge to similar solutions. We also have something called 'relocatable_ptr'. This one tracks only the offset to itself and is nice to build relocatable structures which can be memcopied as long as the offset points to a place withing the copied memory. It's essentially the 'boost::offset_ptr'.

Btw, when you use jemalloc, do you free the memory from a different process than from which you allocate? We did the same for iceory1 but moved to a submission-queue/completion-queue architecture to reduce complexity in the allocator and free the memory in the same process that did the allocation. With iceoryx2 we also plan to be more dynamic and have ideas to implement multiple allocators with different characteristics. Funnily, jmalloc is also on the table for use-cases where fragmentation is not a big problem. Maybe we can create a common library for shm allocating strategies which can be used for both projects.


Hi again!

> I hoped you had an epiphany and found a nice solution for the raw-pointer problem without the need to change them and we could borrow that idea :)

Well, almost. But alas, I am unable to perform magic in which a vaddr in process 1 means the same thing in process 2, without forcing it to happen by using that mmap() option. And indeed, I am glad we didn't go down that road -- it would have worked within Akamai due to our kernel team being able to do such custom things for us, avoiding any conflict and so on; but this would be brittle and not effectively open-sourceable.

> Since the raw-pointer need to be replaced by fancy-pointer, how do you handle STL container? Is there a way to replace the pointer type or some other magic?

Yes, through the allocator. An allocator is, at its core, three things. 1, what to execute when asked to allocate? 2, what to execute when asked to deallocate? 3, and this is the relevant part here, what is the pointer type? This used to be an alias `pointer` directly in the allocator type, but it's done through traits, modernly. Point being: An allocator type can have the pointer type just be T; or* it can alias it to a fancy-pointer type. Furthermore, to be STL-compliant, a container type must religiously follow this convention and never rely on T* being the pointer type. Now, in practice, some GNU stdc++ containers are bad-boys and don't follow this; they will break; but happily:

- clang's libc++ are fine;

- boost.container's are fine (and, of course, implement exactly the required API semantics in general... so you can just use 'em);

- any custom-written containers should be written to be fine; for example see our flow::util::Basic_blob which we use as a nailed-down vector<uint8_t> (with various goodies like predictable allocation size behavior and such) for various purposes. That shows how to write such a container that properly follows STL-compliant allocator behavior. (But again, this is not usually something you have to do: the aforementioned containers are delightful and work. I haven't looked into abseil's.)

So that's how. Granted, subtleties don't stop there. After all, there isn't just "one" SHM arena, the way there is just one general heap. So how to specify which SHM-arena to be allocating-in? One, can use a stateful allocator. But that's pain. Two, can use the activator trick we used. It's quite convenient in the end.

> Btw, when you use jemalloc, do you free the memory from a different process than from which you allocate?

No; this was counter to the safety requirements we wanted to keep to, with SHM-jemalloc. We by default don't even turn on writability into a SHM-arena by any process except the one that creates/manages the arena - can't deallocate without writing. Hence there is some internal, async IPC that occurs for borrower-processes: once a shared_ptr<T> group pointing into SHM reaches ref-count 0, behind the scenes (and asynchronously, since deallocating need not happen at any particular type and shouldn't block user threads), it will indicate to the lending-process this fact. Then once all such borrower-processes have done this, and the same has occurred with the original shared_ptr<T> in the lender-process (which allocated in the first place), the deallocation occurs back in the lender-process.

If one chooses to use SHM-classic (which -- I feel compelled to keep restating for some reason, not sure why -- is a compile-time switch for the session or structure, but not some sort of global decision), then it's all simplicity itself (and very quick -- atomic-int-quick). offset_ptr, internally-stored ref-count of owner-processes; once it reaches 0 then whichever process/piece of code caused it, will itself deallocate it.

The idea of its design is that one could plug-in still more SHM-providers instead of SHM-jemalloc or SHM-classic. It should all keep working through the magic of concepts (not formal C++20 ones... it's C++17).

---

Somewhere above you mentioned collaboration. I claim/hope that Flow-IPC is designed in a pragmatic/no-frills way (tried to vaguely imitate boost.interprocess that way) that exposes whichever layer you want to use, publicly. So, to give an example relating to what we are discussing here:

Suppose someone wants to use iceoryx's badass lock-free mega-fast one-microsecond transmission. But, they'd like to use our SHM-jemalloc dealio to transmit a map<string, vector<Crazy_ass_struct_with_more_pointers_why_not>>. I completely assure you I can do the following tomorrow if I wanted:

- Install iceoryx and get it to essentially work, in that I can transmit little constant-size blobs with it at least. Got my mega-fast transmission going.

- Install Flow-IPC and get it working. Got my SHM-magic going.

- In no more than 1 hour I will write a program that uses just the SHM-magic part of Flow-IPC -- none of its actual IPC-transmission itself per se (which I claim itself is pretty good -- but it ain't lock-free custom awesomeness suitable for real-time automobile parts or what-not) -- but uses iceoryx's blob-transmission.

It would just need to ->construct<T>() with Flow-IPC (this gets a shared_ptr<T>); then ->lend_object<T>() (this gets a tiny blob containing an opaque SHM-handle); then use iceoryx to transmit the tiny blob (I would imagine this is the easiest possible thing to do using iceoryx); on the receiver call Flow-IPC ->borrow_object<T>(). This gets the shared_ptr<T> -- just the like the original. And that's it. It'll get deallocated once both shared_ptr<T> groups in both processes have reached ref-count 0. A cross-process shared_ptr<T> if you will. (And it is by the way just a shared_ptr<T>: not some custom type monstrosity. It does have a custom deleter, naturally, but as we know that's not a compile-time decision.)

So yes, believe it or not, I was not trying to out-compete you all here. There is zero doubt you're very good at what you do. The most natural use cases for the two overlap but are hardly the same. Live and let live, I say.


Don't worry. It's great to have other projects in this field, exploring different routes and you created a great piece of software. The best thing, it's all open source after all :)

Reading your response is almost as you've been at our coffee chats. Quite some of your ideas are also either already implemented in iceoryx2 or on our todo list. It seems we just put our focus on different things. Here and there you also added the cherry on top. This motivates us to improve on some areas we neglected the last years. We can learn from each other and improve our projects thanks to the beauty of open source.

Keep up the good work


Whoa. I’m the lead developer on this - I got to this post totally by accident: was googling for my own Show HN post about this from a couple days ago - and it took me here without my noticing.

There’s some discussion on it in Show HN, and of course I can answer anything here that people might be interested in too. I’m very proud of it and very grateful Akamai gave the resources to open-source it.

I’d like to have a flashier friendlier site with a slick intro video - haven’t had the time to do that stuff - but the substance and API documentation + Manual are very serious and complete, I hope.

All linked off the blog-post!


That's great! Show HNs are preferred so I've re-upped your original post and have merged the other thread (https://news.ycombinator.com/item?id=40000104) hither.

But readers will probably want to look at the other article as well: https://www.linode.com/blog/open-source/flow-ipc-introductio....


Whoa (again), super-cool, appreciate that.


Your submission is [dead], dunno why. https://news.ycombinator.com/item?id=39987732



Serialization is the trivial part; the hard part is building a lockfree mpmc queue or message bus (depending on what you want) on top of fixed-size pre-allocated memory segments.

I can't tell what this library does; the blog articles and readme all talk about stuff that isn't close to any of the challenges that I see.


While I wouldn't dream of claiming Flow-IPC will fit every IPCer's priorities, nor of trying to change yours or anyone's, nor of debating about what is trivial versus hard -- it should at least be easily possible to know what's in Flow-IPC. I'm here to help; this is the API overview with various code-snippet synopses, etc.:

https://flow-ipc.github.io/doc/flow-ipc/versions/main/genera...

I should also note that Flow-IPC does not provide "serialization"; it does however enable the use of an existing/best serializer (capnp) for zero-copy messaging. This is only one feature -- albeit oft requested, hence my decision to base the blog/README example on it. (I'm currently also looking into extending this to capnp-RPC.)

But, of course, we don't expect it to match what everyone is looking for; in your case IceOryx might be more your speed -- have a look.


It's so hard to communicate this stuff in writing! There are several angles of potential interest; I wish I could simply chat in-person with anyone curious, you know? Of course that is impossible. (I'll do my best here at HN and the Flow-IPC Discussions board at GitHub.)

I hope the above 2 links get the job done in communicating the key points. There is certainly no shortage of documentation! Still:

If you'll indulge me, I do want to share how this project got started and became open-source. I actually do suspect this might help one get a feeling of what this thing is, and is not.

My name is Yuri Goldfeld. I have worked at Akamai since 2005 (with a break for startup shenanigans, and VMware, in the middle). I designed or co-designed Flow-IPC and wrote about 75% of it (by lines of code ignoring comments); my colleague Eddy Chan wrote the rest, including the bulk of the SHM-jemalloc component (which is really cool IMO).

Akamai in certain core parts is a C++/Linux shop, with dogged scrutiny to latency. Every millisecond along the request path is scrutinized. A few years ago I was asked to do a couple things: - Determine the best serializer to use, in general, but especially for IPC protocols. The answer there was easy IMO: Cap'n Proto. - Split-up a certain important C++ service into several parts, for various reasons, without adding latency to the request path.

The latter task meant, among other things, communicating large amounts of user data from server application to server application. capnp-encoded structures (sometimes big - but not necessarily) would also need to be transmitted; as would FDs.

The technical answers to these challenges are not necessarily rocket science. FDs can be transmitted via Unix domain socket as "ancillary data"; the POSIX `sendmsg()` API is hairy but usable. Small messages can be transmitted via Unix domain socket, or pipe, or POSIX MQ (etc.). Large blobs of data it would not be okay to transmit via those transports, as too much copying into and out of kernel buffers is involved and would add major latency, so we'd have to use shared memory (SHM). Certainly a hairy technology... but again, doable. And as for capnp - well - you "just" code a `MessageBuilder` implementation that allocates segments in SHM instead of regular heap like `capnp::MallocMessageBuilder` does.

Thing is, I noticed that various parts of the company had similar needs. I've observed some variation of each of the aforementioned tasks custom-implemented - again, and again, and again. None of these implementations could really be reused anywhere else. Most of them ran into the same problems - none of which is that big a deal on its own, but together (and across projects) it more than adds up. To coders it's annoying. And to the business, it's expensive!

Plus, at least one thing actually proved to be technically quite hard. Sharing (via SHM) a native C++ structure involving STL containers and/or raw pointers: downright tough to achieve in a general way. At least with Boost.interprocess (https://www.boost.org/doc/libs/1_84_0/doc/html/interprocess....) - which is really quite thoughtful - one can accomplish a lot... but even then, there are key limitations, in terms of safety and ease of use/reusability. (I'm being a bit vague here... trying to keep the length under control.)

So, I decided to not just design/code an "IPC thing" for that original key C++ service I was being asked to split... but rather one that could be used as a general toolkit, for any C++ applications. Originally we named it Akamai-IPC, then renamed it Flow-IPC.

As a result of that origin story, Flow-IPC is... hmmm... meat-and-potatoes, pragmatic. It is not a "framework." It does not replace or compete with gRPC. (It can, instead, speed RPC frameworks up by providing the zero-copy transmission substrate.) I hope that it is neither niche nor high-maintenance.

To wit: If you merely want to send some binary-blob messages and/or FDs, it'll do that - and make it easier by letting you set-up a single session between the 2 processes, instead of making you worry about socket names and cleanup. (But, that's optional! If you simply want to set up a Unix domain socket yourself, you can.) If you want to add structured messaging, it supports Cap'n Proto - as noted - and right out of the box it'll be zero-copy end-to-end. That is, it'll do all the SHM stuff without a single `shm_open()` or `mmap()` or `ftruncate()` on your part. And if you want to customize how that all works, those layers and concepts are formally available to you. (No need to modify Flow-IPC yourself: just implement certain concepts and plug them in, at compile-time.)

Lastly, for those who want to work with native C++ data directly in SHM, it'll simplify setup/cleanup considerably compared to what's typical. For the original Akamai service in question, we needed to use SHM as intensively as one typically uses the regular heap. So in particular Boost.interprocess's built-in 2 SHM-allocation algorithms were not sufficient. We needed something more industrial-strength. So we adapted jemalloc (https://jemalloc.net/) to work in SHM, and worked that into Flow-IPC as a standard available feature. (jemalloc powers FreeBSD and big parts of Meta.) So jemalloc's anti-fragmentation algorithms, thread caching - all that stuff - will work for our SHM allocations.

Having accepted this basic plan - develop a reusable IPC library that handled the above oft-repeated needs - Eddy Chan joined and especially heavily contributed on the jemalloc aspects. A couple years later we had it ready for internal Akamai use. All throughout we kept it general - not Akamai-specific (and certainly not specific to that original C++ service that started it all off) - and personally I felt it was a very natural candidate for open-source.

To my delight, once I announced it internally, the immediate reaction from higher-up was, "you should open-source it." Not only that, we were given the resources and goodwill to actually do it. I have learned that it's not easy to make something like this presentable publicly, even having developed it with that in mind. (BTW it is about 69k lines of code, 92k lines of comments, excluding the Manual.)

So, that's what happened. We wrote a thing useful for various teams internally at Akamai - and then Akamai decided we should share it with the world. That's how open-source thrives, we figured.

On a personal level, of course it would be gratifying if others found it useful and/or themselves contributed. What a cool feeling that would be! After working with exemplary open-source stuff like capnp, it'd be amazing to offer even a fraction of that usefulness. But, we don't gain from "market share." It really is just there to be useful. So we hope it is!


That's an impressive read, thank you and congrats on the release! I think that nowadays the development and adoption of performant IPC mechanisms is unfairly low, it's good to have such tech opensourced.

My question is, how does Flow-IPC compare to projects like Mojo IPC (from Chromium) and Eclipse iceoryx? At first glance they all pursue similar goals and pay much less attention to complex allocation management, yet managing to perform well enough.


Appreciate your time! And, naturally, this was the question I expected to pop up once I was able to work through everything required internally here at Akamai to actually put this guy out in public. Wouldn't it be sad :-( if the same thing already existed, and we just hadn't noticed it?

In tactical terms, back when this all started, of course we looked around for something to use; after all why write a whole thing, if we could use something? We didn't write a serializer, for example, since a kick-butt one (capnp - and FlatBuffers also seems fine) already existed. Back then, though, nothing really jumped out. So looking back, it may have simply been a race; a few people/groups out there saw this niche and started developing things. I see iceoryx in particular has one identical plank, which is workable/general end-to-end zero-copy via SHM; and it was released a couple years before, hence has a super nice presentation I hugely appreciate: many well-documented examples in particular. Whereas for us, providing that will take some more effort. (That said, we did not skimp on documentation: everything is documented meticulously, and there is a hopefully-reader-friendly Manual as well.)

When it came down to the core abilities we needed, it was like this: 1. We wanted to be able to share arbitrary combinations of C++ native structures, and not just PoDs (plain-old-datatypes). Meaning, things with pointers needed to work; and things with STL-compliant containers needed to work. Boost.interprocess initially looked like it got that job done... but not enough for our use-case at least. When it came down to it, with Boost.ipc:

- Allocation from a SHM-segment had to be done using a built-in Boost-written heap-allocation algorithm (they provided two of them, and you can plug in your own... as long as all the control structures lived inside SHM).

- The shared data structure had to live entirely within one SHM-segment (mmap()ed area).

But, we needed some heavy-duty allocation - the Boost ones are not that. Plugging in a commercial-grade one - like jemalloc - was an option, but that was itself quite a project, especially since the control structures have to live in SHM for it to work. jemalloc is the most advanced thing available, but it kept control structures as globals, so plopping those into SHM meant changing jemalloc (a lot... Eddy actually did pursue this during the design phase). Plus, having both sides of the conversation reading and writing in one shared SHM-segment was not great due to safety concerns.

And, whatever allocation would be used - with Boost.interprocess's straightforward assumptions - had to be constrained to one mmap()ed area (SHM-segment). jemalloc (for example; substitute tcmalloc or any other heap-provider as desired) would want to mmap() new segments at will. Boost.ipc doesn't work in that advanced way.

2. We wanted to to send capnp-encoded messages (and, more generally, just "stuff" - linear buffers) with end-to-end zero-copy, meaning capnp-segments would need to be allocated in SHM. I spoke with Kenton Varda (Cap'n Proto overlord) very recently; he too felt this straightforward desire of not piping-over copies of capnp-encoded items. Various Akamai teams implemented and reimplemented this by hand, for specific use cases, but as I said earlier, it wasn't reusable in a general way (not for our specific use-case for that original big C++ service that I was tasked with splitting-up).

Other niceties were desirable too - not worrying about names IPC-resource names/conflicts/..., ensuring SHM cleanup straightforwardly on exit or crash - but they were more tangential (albeit extremely useful) things that came about once we decided to handle the core (1) and (2) in reusable fashion.

At that point, nothing seemed to be around that would just give us those fairly intuitive things. I am not saying these are necessary for every IPC use-case... but they never hurt at the very least, and having those readily available give one a feeling of power and freedom.

Now, as to the actual question: How does it compare to those? I am not going to lie (because lying is bad): It'll take me a few days to understand the ins and outs of Mojo IPC and iceoryx, so any impression I give here is going to be preliminary and surface-level. To that point, I expect the correct/true answer to your question will be a matter of diving into each API and simply seeing which one seems best to the particular potential user. For Flow-IPC, this Manual page here should be a pretty decent overview of what's available with code snippets: https://flow-ipc.github.io/doc/flow-ipc/versions/main/genera...

That said, my preliminary initial impression is:

(cont.)


Versus iceoryx (the C++ version, not the Rust-oriented iceoryx2):

TL;DR: So far, it looks super-sweet (as well as mature, already supporting macOS for example). However more of an investment to use than is Flow-IPC, with a central daemon and a special event-loop model. It also doesn't want to do #1 above described (no pointers, no using existing STL-compliant container types).

This guy seems really cool, and it directly addresses at least the major part of need #2 above. You can transmit buffers with near-zero latency, and it'll do the SHM stuff for you. (For capnp specifically one would then implement the required SHM-allocating capnp::MessageBuilder, and off we go. Flow-IPC does give you this part out-of-the-box, granted.) Looking over the examples and overview, it seems like integrating it into an event loop might involve some pretty serious learning of iceoryx's event-loop model + subscribe/publish. There is also a central daemon that needs to run.

Flow-IPC, to me, seems to have a lower-learning/lower-maintenance curve approach to this. There's no central daemon or any equivalent of it. For each asynchronous thing (a transport::Channel, for example, which has receive-oriented methods), you can use one of 2 supplied APIs. The sync_io-style API will let you plug into anything select()/poll()/epoll()-oriented (and has a syntactic-sugar hook for boost.asio loops). If you've got an event loop, it'll be easy to plug Flow-IPC ops right into it - no background threads added thereby. Or, use the async-I/O-style API; then it'll create background threads as needed and call your callback (e.g., on message receipt) from there, leaving it to you to handle it there or by posting the "true" handling onto one of your own threads.

Point being, my impression so far is, using Flow-IPC in this sense is a lower-effort enterprise. It's pretty much just there to plug-in. (I really hope that isn't slander. That's my take so far - as I said, it'll take me a few days to understand these products in-depth.)

Now, in terms of need #1. (I acknowledge, this need is not for every C++ IPC use-case ever. 2 processes collaborating on one native C++ data structure full of SHM-compliant containers and/or pointers =/= done every day. Still, though, if 2 threads in one process can do it easily, why shouldn't they as-easily be able to do it across a process boundary? Right?) If I understand iceoryx's example on this topic (https://iceoryx.io/latest/examples/complexdata/)... I quote: "To implement zero-copy data transfer we use a shared memory approach. This requires that every data structure needs to be entirely contained in the shared memory and must not internally use pointers or references. ... Therefore, most of the STL types cannot be used, but we reimplemented some constructs. This example shows how to send/receive a iox::cxx::vector and how to send/receive a complex data structure containing some of our STL container surrogates."

With Flow-IPC, this does not apply. You can share existing STL-compliant containers, and (if you want) can have raw pointers too. We have tests nesting boost::container string/vector/map guys and our own flow::util::Basic_blob STL-compliant guy and sharing them, no problem. We've provided the necessary allocator and fancy-pointer types. Moreover, with a single line you can do this in jemalloc-allocated SHM; or instead choose a Boost.ipc-backed single-segment SHM. (Depends on what you desire for safety versus simplicity, internally. I am being a bit vague on that here, but it's in the docs, I promise.) I believe this is a pretty good illustration of Flow-IPC's "thing":

- Meat-and-potatoes, do what you want to do in your daily C++, without a major learning curve... - ...but without sacrificing essential power... - ...and extensibly, meaning you can modify its behavior in core ways without requiring a massive amount of learning of how Flow-IPC is built.

Versus Mojo IPC:

I really need to understand it better, before I can really comment. So far, it seems like its equivalent of Flow-IPC's sessions = super cool, building up a network of processes that can all talk to each other once in the network. Flow-IPC's sessions are basic: you want process A and B to speak, you establish a session (during this step, one is designated as the session-server and can therefore accept more sessions from that app or other apps)... then from there, you can make channels (and access SHM arenas, if you are using SHM directly as opposed to letting the zero-copy channels do it invisibly). It also has various-language bindings; Flow-IPC is C++... straight up.

That established, I need to understand it better. It looks like it provides super-fast low-level IPC transports (similar to Flow-IPC's unstructured-layer channels) in platform-agnostic fashion - but does not seem to specifically facilitate end-to-end zero-copy transmission of data structures via SHM. I could be completely wrong here, but it actually looks like one could feasibly plug-in Mojo IPC pipes as Flow-IPC Blob_sender/receiver (and/or Native_handle_sender/receiver) concept impl, into Flow-IPC, and get the end-to-end zero-copy goodness.

At least superficially, so far, Flow-IPC again looks like perhaps a more down-to-earth/readily-pluggable effort. (But, still documented out-the-wazoo!)


I am one of the maintainers of iceoryx and the creator of iceoryx2, so I wanted to add and complete some more details.

iceoryx/iceoryx2 was intended for safety-critical systems initially but now expands to all other domains. In safety-critical systems that run, for instance, in cars or planes, you do not want to have undefined behavior - but the STL is full of it, so we had to reimplement an STL subset in (https://github.com/eclipse-iceoryx/iceoryx/tree/master/iceor...) that does not use heap, exceptions or comes with undefined behavior. So you can send vectors or strings via iceoryx, but you have to use our STL implementations.

It also comes with a service-oriented architecture; you can create a service - identified by name - and communicate via publish-subscribe, request-response, and direct events (and in the planning: pipeline or blackboard).

One major thing is iceoryx robustness. In safety-critical systems, we have a term called freedom-of-interference, meaning that a crash in application A does not affect application B. When they communicate via shared memory, for instance, and use a mutex, they could dead-lock each other when one app dies while holding the mutex. This is why we go for lock-free algorithms here that are tested meticulously, and we are also planning a formal verification of those lock-free constructs.

iceoryx2 is the next-gen of iceoryx where we refactored the architecture heavily to make it more modular and address all the major pain points. * no longer requires a central daemon and has decentralized all the management tasks, so you get the same behavior without the daemon * comes with events that can be either based on an underlying fd-event (slower but can be integrated with OS event-multiplexing), or you can choose the fast semaphore route (it is now up to the user)

Currently, we are also working on language bindings for C, C++, Python, Lua, Swift, C#, etc.


Thanks for the detailed answer! I really appreciate that.


You’re welcome. But I must tell you, at work I asked how my answers are, keep me honest. So a coworker looked at this thread and was just like, “dude just get to the point, no one wants to read all that.” And then explained that in huge detail.

That’s just how I talk. With all the writing I’ve had to do lately - documentation, blog, announcements - it’s been a constant struggle forcing myself to say fewer words, keep it short, keep the eyeballs, come onnnnnnn, edit edit edit!!! And that’s good… it’s how it should be. It’s just totally unnatural to me personally… hehe.

FINALLY there’s a chance to simply talk about it to some humans, so I uh… maybe went a little wild with the verbosity.


Actually, I'm happy you spent all those words, and I read them all.

I've been looking for an SHM IPC for a very long time and not finding one. It's nice to know that I'm not the only idiot thinking along these lines.

In addition, it's also nice to know that this was hard. I have taken several stabs at doing this, and I always bounced off thinking "It can't be this difficult. I'm screwing up." Seeing that smart people working for a real company had to do major surgery on something like jemalloc is a bit of a validation.

Can't say I'm happy to see this in C++, but I'll take what I can get. :)

Thanks to all the folks who wrote it. And thank you for the long winded explanations otherwise I probably would have ignored it.


1. You’re welcome!

2. I too would like to explore other developments - not that I think C++ sucks or anything (not that I don’t think it doesn’t suck sometimes!) - Rust and all. Here’s hoping

3. At points during development of this, the ol’ impostor syndrome would kick in. “Surely someone would’ve done this already.” Or more often, “Meh, people will just roll their own version of this, it’s not that hard.” But then I’d actually go through the exercise of implementing whatever it was - and think to myself, “that WAS NOT obvious.” It dawned on me that by doing it, I proved (to myself at least) that it’s not easy to do it, and thus perhaps worth having started.


Thank you so much for posting everything that you did. Long-form details are hard to find. In my truest contribution, I suggest you're a pleasure to work with. Rarely have I seen anyone willing to give the deep-digest of their determination and problems with a function. Consider writing more!

EDIT

I didn't see the other posts about brevity. Fuck that. Details matter and so does the human experience. Who hasn't on this site been the unfortunate recipient of trying to get some brilliant but shitty function to work?


I’ve spent a lot of time with boost asio and serialisation of objects into a boost variant to send that across the wire. The server vists the variant to process the message. Including boost shared memory for file data.

Both for unix domain sockets and TCP.

There’re plenty of boost examples around so, I’d suggest, you take their examples and work them for your framework.

As I’m sure you’re aware, a clean and easy to read example will make a difference.

It’s great that you’re open source and I hope you get some traction.


Indeed, examples from every angle are probably the one deficit of the existing documentation. There are a couple, such as the perf_demo described in the blog post. I’d like to add ones showing integration with

- epoll based event loop

- boost.asio based event loop

(Boost.interprocess and boost.asio are huge inspirations and are both used inside!)

As for traction: it’s tough! Have to get eyeballs; and then have to convey a sense of being worth one’s trust.

Thank you for your time.


Integration with boost asio would be of interest to many - myself included. It is the defacto for anyone who’s got past Stephen’s Unix Network Programming.

It would gain a level of trust with developers.


Roger dodger.

For what it is worth at this time - obviously acting on the following statement will require some level of trust -

It is very much ready to use with boost.asio. (I know that, because I myself use boost.asio religiously. If it were not compatible with it, I'd pretty much have to not use Flow-IPC myself.) Though, it could (fairly easily) gain a number of wrapper classes that would turn our stuff into actual boost.asio I/O objects; then it'd be even more straightforward.

Topic is covered here:

https://flow-ipc.github.io/doc/flow-ipc/versions/main/genera...

There's even the little section entitled, "I'm a boost.asio user. Can't I just give your constructor my io_context, and then you'll place the completion handler directly onto it?"

To summarize, though...

-1- You can have Flow-IPC create background threads as-needed and ping your completion handler (e.g., "message received") from such threads.

-2- You can have it not create any background threads, instead asking you to .async_wait() (via boost.asio, most easily; but also manually with poll() or whatever you want) whenever it needs internally to async-await something. Your own completion handler (e.g., handle just-received message M) shall execute synchronously at only predictable points, in non-blocking fashion.

-3- Direct integration with boost.asio - meaning ipc::transport::Channel (e.g.) would take an io_context/executor/whatever in its ctor, and .async_X(F) would indeed post F onto that io_context/executor/whatever = essentially syntactic sugar = a TODO. (I'd best file an Issue, I just remembered.)

The perf_demo (partially recreated in the blog-post) integrates into a single-threaded boost.asio io_context, using technique #2 above. In the source code snippets in the blog, we avoided anything asynchronous just to keep it focused for the max # of readers (hopefully).


Top tip: ensure your ASIO code is not exported from a shared library.

I’ve been hit by Cephfs using some version and my own code using another.

The fixes were simple though.

Edit: as for performance, I’d not focus on that too much. It’ll depend on circumstances the end user has. Myself, I’d measure the interfaces with stack based timings and dump to a JSON file at exit. Graphs under various loads and an a/b comparison.

As an example, on a dedupe system I measure LZO was better for performance than LZ4. HPE rack units with spinning rust disks.

Edit 2: I’ve forwarded your GitHub to my work account. I’ll offer the research to a colleague (Jira backlog) to look at when “someone” wants our new system to be faster. We have a boost asio solution I wrote that works - local unix domain sockets. Hitachi NAS.


i have done something exactly identical at my current place of employment, and am always inquisitive to see how others have 'stacked-da-cat'.

we _unfortunately_ gravitated towards protobuf's despite my fervent appeal to go with capn-proto. that has caused a cascade of troubles / missed opportunities for optimizations etc. etc.


I tried to migrate to capn-proto but it just doesn't build on MinGW so I have no choice but to wait. Like you say, it gets worse the more I wait. But, if the APIs are somewhat sane, they should hopefully also be somewhat similar: Able to switch case on oneofs, movable data structures etc.

I don't like that protobuf has recently started linking with abseil, which despite being a good framework, I can't use it if it doesn't build absolutely everywhere I need it to. So, maybe I'll be forced over to CapnProto one of these days?


Honestly I like capn-proto. But it's community is a bit light. Even Windows which is "supported" has been a bit fix-it-yourself.

If MinGW isn't already supported, if probably expect to have to take that over yourself.


I've also developed a strikingly similar low latency real-time IPC message bus for work. It also uses sockets with transparent shared memory optimization. In my case, it's the backbone for an autonomous aircraft's avionics. I made everything agnostic to the message scheme, though, and most of the tooling supports an in-house schema, protobuf, JSON, YAML, etc. There's also clients implemented in C++, Rust, Python and Julia.

What troubles has protobuf caused you?


> What troubles has protobuf caused you?

we are primarily a c++ shop, so all comments are from that vantage point. in no particular order following:

     1. extremely bloated generated code.

     2. wire format is not optimal, specifically for large nested messages which trashes the cache on a already severely constrained h/w.

     3. not easy to perform shared-memory exchange for large messages, for example, if you have 'string' types in your messages, those cannot be allocated on arenas.

     4. <this space is for rent>


At the risk of almost-spamming -- this post has taken off, which is sweet, and I have noticed some trends about what seems to interest readers which may not be placed as prominently up-top as would fit this audience. To wit: the API Overview from the Manual covers the available stuff, with code snippets (and some words). Could save people some time:

https://flow-ipc.github.io/doc/flow-ipc/versions/main/genera...


> Currently Flow-IPC is for Linux

Dang. I was excited for a brief moment, but support for macOS + Windows is mandatory for all of my use cases.

To be honest what I actually want is NOT "the fastest possible thing". All I actually care about is "easy advertisement, discovery, and message sending". I use localhost TCP way more than I want because it "just works".

Maybe someday I'll stumble across my dream IPC library.


Oooh, so close. We’ve got the advertisement/discovery and messaging for sure.

Concretely what it would take to port it to those OS: https://github.com/Flow-IPC/ipc/issues/101

Given a couple weeks to work on it, this thing would be on macOS no problem. With Windows I personally need to understand its FD-passing and native handle concepts first, but I’m guessing it’d be a similar amount of effort in the end.


I thought the same when we ported iceoryx to Mac OS and Windows. Mac OS is pretty straightforward, except it does not support unnamed semaphores. But Windows is an entirely different story. For once, it supports only streaming unix domain sockets, and it didn't support the transfer of FDs when we ported iceoryx to Windows back then - maybe it supports it now. Also, when you want to perform some access control with access rights, you have to face sid— and ace-strings - oh they are fun. And, of course, there are all the nasty details; for instance, Windows defines macros that lead to compilation failures since they collide with internal naming. Take a look at this here, maybe it makes your efforts less painless: https://github.com/eclipse-iceoryx/iceoryx/blob/master/iceor...

You could reuse the iceoryx platform layer that enables iceoryx to run on every platform from qnx, linux, freertos, mac, windows. Maybe it can help you as well: https://github.com/eclipse-iceoryx/iceoryx/blob/master/doc/w...


(Akamai owns Linode and uses the blog on Linode.com as a developer-oriented blog. So that’s why the link is to there.)


Cool stuff!

Does Flow-IPC protect against malformed messages? For example a client sending malformed messages to a server process


Given that it's shared memory based, it seems like there has to be some degree of trust that the participants are well behaved. What do you mean by a malformed message, though? If you're talking about the payload of the message, that seems like a matter of the message scheme you're using. If you're talking about correctness of the IPC protocol itself, integrity checking is unfortunately at odds with latency.


Reply co-signed!

That said, I'll add that my Manual (https://flow-ipc.github.io/doc/flow-ipc/versions/main/genera...) gets into the topic of safety (with a few words on how this differs from security; this indeed involves trust). So indeed Flow-IPC does not attempt to provide security for malicious/untrusted conversation-partner -- but it does provide a few enabled-by-default safety mechanisms. The Manual page in question: https://flow-ipc.github.io/doc/flow-ipc/versions/main/genera...

Among other things, the capnp-layer (which, as I noted in a recent reply earlier, is optional to use -- you can and sometimes certainly should "just" go native, and/or combine the two approaches) uses an internally-generated token to ensure some kind of terrible bug/thing/problem/act-of-god isn't mis-routing messages. It's a safety mechanism.

But in terms of, say, incompatible schemas or mis-matching native `struct`s -- what if you used different compilers with slightly different STLs on the 2 sides?! -- it is indeed on the developer to not make that error/not converse with untrusted code. Flow-IPC will guard against misuses to the extent it's reasonable though, IMO.

P.S. [Internal impl note] Oh! And, although at the moment it is the protocol version v1 (for all internally-involved protocols at all layers), I did build-in a protocol-version-checking system from the start, so as to avoid shooting ourselves in the foot, in case Flow-IPC needs to expand some of its internal protocol(s) in later versions. At the very worst, Flow-IPC would refuse to establish a channel/session upon encountering a partner with Flow-IPC version such that their protocols are incompatible. (Again -- academic at the moment, since there is only v1 -- but might be different in the future. At that point a new protocol might be developed to be backward-compatible with earlier-Flow-IPCs and just still work; or worst case throw the aforementioned error if not.)


Do you have any concrete plans about a potential network extension yet?


A couple -

1. The obvious one is “just” extending stuff internally working via Unix domain sockets to TCP sockets. Various internal code is written with an eye to that, including anticipating that certain operations (such as connect) that are instant locally can would-block in a network.

If people enjoy the API, this would be a no-brainer value-add, even if lots of people would scoff and use actual dedicated networking techniques (HTTP, whatever) directly instead.

2. The much more fun and unique idea is using RDMA, “sort of” a networked-SHM type of setup (internally). Hope to get a go-ahead (or contribution, of course) on this.

I mention these in the intro page of the Manual, I think.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: