Great question! This has been top of mind for me for the last 2–3 years.
Short answer: sadly, no. I love the "usability" promise of coroutines—and even have 2–3 FOSS projects that could be rewritten entirely around C++ or Rust coroutines for better debuggability and extensibility—but my experiments show that the runtime cost of most coroutine‑like abstractions is simply too high. Frankly, I’m not even sure if a better design is possible on modern hardware.
This leads me to conclude that, despite my passion for SIMD and superscalar execution, the highest‑impact new assembly instructions that x86 and Arm could standardize would center on async execution and lightweight context switching... yet I haven’t seen any movement in that direction.
⸻
I also wrote toy examples for various range/async/stream models in C++, Rust, and Python, with measured latencies in inline comments:
Aside from coroutines (toy hand-rolled implementations and commonly used libraries), I've also played around C++ executors, senders & receivers, but didn't have much success with them either. May be a skill issue.
> my experiments show that the runtime cost of most coroutine‑like abstractions is simply too high
Which runtime cost do you mean?
The main one I am aware of is a heap allocation per coroutine, though this can in some cases be elided if the coroutine is being called from another coroutine.
The other cost I am aware of is the initializing of the coroutine handle, but I think this is just a couple of pointers.
In both cases I would expect these overheads to be relatively modest compared to the cost of the I/O itself, though it's definitely better to elide the heap allocation when possible.
I don't know much about coroutine libraries like unifex (which I think your test is using), but a hand-coded prototype I was playing with doesn't seem to add much overhead: https://godbolt.org/z/8Kc1oKf15
My exploration into coroutines and I/O is only in the earliest stages, so I won't claim any of this to be definitive. But I am very interested in this question of whether the overhead is low enough to be a good match for io_uring or not.
the cost of context switch consists of two parts, one of which can be subdivided:
1. register save/restore
1a. user-space only registers
1b. full register save/restore, required in kernel space
2. the cost of the TLB flush, which is in turn proportional to the working set size of the switched-to process (i.e. if you don't touch much memory after the context switch, the cost is lower than if you do)
I am not sure that any assembler instructions could address either of these.
Does C++ have a good async ('coroutine') story for io_uring yet?