The following comes from the complete opposite side of computing, microcontrollers. I've been working on an embedded system where the heap is about 256 KiB and the biggest stack has 4 KiB. I do write idiomatic modern C++ for the most part (even if I hate C++ metaprogramming with a passion), but not all tricks are suitable in all situations:
- CTRE is fine as long as you don't overflow the stack. I tried once to validate a string for a HTTP proxy configuration with an exhaustive regex, CTRE tried to allocate 5 KiB of stack 40 call frames in and therefore crashed the embedded system with a stack overflow. I've had to remove port validation from the regex (matching a number between 1 and 65535 was a bridge too far) and check that part by hand instead. I've also had to dumb down other CTRE regexes too in my code for similar reasons.
- Several constraints and design decisions led me to mostly ditch JSON internally and write my own BSON library. Instead of the traditional dynamically-allocated tree of nodes approach it works directly in-place, so I can give it a std::vector with a chunk of reserved memory upfront and not worry about memory allocation or fragmentation later on. One major benefit is that since there are no string escape sequences, I can return directly a std::string_view for string values. There are downsides to this approach, mostly revolving around modifications: one needs to be very careful not to invalidate iterators (which are raw pointers to the underlying buffer) while doing so and adding/removing entries towards the beginning of a large document is expensive due to the memmove().
- I ditched newlib for picolibc and exterminated anything that pulled in the C/C++ standard library locale code (that was over 130 kilobytes of Flash altogether IIRC), which includes among other things C++ streams (they are bad for plenty of other reasons too, but mine was program size).
You seem to have mostly aimed for throughput and raw performance in your benchmarks, which is fine for a generic desktop or server-class system with a MMU and plenty of resources. Just wanna point out that other environments will have different constraints that will mandate different kinds of optimizations, like memory usage (heap/stack/program size), dynamic memory fragmentation, real-time/jitter...
I’ve done C++ on a Cortex-M0+ with 8KB of flash. Code size is a big issue. You have to disable a bunch of stuff (no exceptions, nothing that does dynamic allocation) but you can still use classes, virtual methods, templates, constexpr, etc. These are all things that are a pain to do in C and usually require a bunch of gross macros.
As a former C++ programmer now writing C, I think this only true for templates, but it if you limited to somewhat this is also fine. For constexpr it depends what you use it for. If it something expensive to compute I would just run a program at build time (caching the output) and include the result. This seems preferable to me anyhow. The same for tests.
Yeah, embedded C++ is a wildly different experience from vanilla. I've worked in large embedded C++ codebases where we couldn't use the STL and had to use homegrown containers for everything.
I wonder how Rust is stacking up (no pun intended) in the embedded game these days.
Very true! I'd also go for similar optimizations when processing texts or sparse linear algebra on Nvidia and AMD GPUs. You only have ~50 KB of constant memory, ~50 MB of shared memory, and ~50 GB of global memory. It is BIG compared to microcontrollers but very little compared to the scope of problems often solved on GPUs. So many optimizations revolve around compressed representations and coalesced memory accesses.
I am still looking for a short example of such CUDA kernels, and I would love to see more embedded examples if you have thoughts ;)
I haven't had to reach for them so far either professionally or personally, but custom memory allocators (slab allocation, bump allocator...) and allocation strategies is something I've been meaning to look into. Too bad that the one game I've done reverse-engineering on used dynamic memory allocation for just about everything, with an allocator that uses a singly-linked list of used/free chunks that wouldn't look out of place in the 1980s.
I'm aware that the C++ standard library has polymorphic allocators alongside a couple of memory resource implementations. I've also heard that the dynamic dispatch for the polymorphic allocators could bring some optimization or speed penalties compared to a statically dispatched allocator or the standard std::allocator that uses operator new(), but I have no concrete data to judge either way.
> CTRE is fine as long as you don't overflow the stack
Which is to say CTRE is mostly not fine, if you use it on user-provided strings, regardless of target environment. It's heavily recursion based, with never spilling to the heap and otherwise no safeguards for memory use/recursion depth.
Once I've given up validating the port number with the regex, it no longer blew up the stack:
^http:\/\/([a-z0-9.-]+)\/?:([1-9][0-9]{0,4})$
I'll admit I haven't done a thorough job of auditing the stack usage afterwards, but not all regexes look like Perl codegolf. For simple, straightforward patterns I don't see any problems using CTRE, but I'd be interested to see some proof to the contrary if you have some.
Not sure I'd reach for C++ or regexes in such a constrained micro environment. Anything where you don't directly understand the precise memory use is probably out.
The NumWorks N0100 graphical calculator had 1 MiB of Flash and 256 KiB of RAM. It packed seven mathematical apps (calculation, grapher, equations, statistics, regression, sequences, distributions) with a decently powerful maths engine/equation typesetter written in C++ and a MicroPython shell. They've paid a fair amount of attention to details in order to fit all of that in (least of all no STL), but C++ wielded correctly for embedded is no more of a memory hog than C.
Our target has ~1.5 MiB of Flash for program code and 512 KiB of RAM. We're using half of the former and maybe a third of the latter, the team barely paid any attention to program size or memory consumption. One day the project lead became slightly concerned about that and by the end of the day I shed off 20% of Flash and RAM usage going for the lowest hanging fruits.
I find it a bit amusing to call a 250 MHz STM32H5 MCU a constrained micro environment, if anything it's a bit overkill for what we need.
> I find it a bit amusing to call a 250 MHz STM32H5 MCU a constrained micro environment, if anything it's a bit overkill for what we need.
I took an "embedded" systems class in college 15+ years ago that targeted a 32-bit ARM with megabytes of ram, so using these kBs of RAM micros in 2025 definitely feels like a constrained environment to me. The platforms I work on with C++ professionally have, ya know, hundreds of gigabytes of RAM (and our application gets ~100% of it).
- CTRE is fine as long as you don't overflow the stack. I tried once to validate a string for a HTTP proxy configuration with an exhaustive regex, CTRE tried to allocate 5 KiB of stack 40 call frames in and therefore crashed the embedded system with a stack overflow. I've had to remove port validation from the regex (matching a number between 1 and 65535 was a bridge too far) and check that part by hand instead. I've also had to dumb down other CTRE regexes too in my code for similar reasons.
- Several constraints and design decisions led me to mostly ditch JSON internally and write my own BSON library. Instead of the traditional dynamically-allocated tree of nodes approach it works directly in-place, so I can give it a std::vector with a chunk of reserved memory upfront and not worry about memory allocation or fragmentation later on. One major benefit is that since there are no string escape sequences, I can return directly a std::string_view for string values. There are downsides to this approach, mostly revolving around modifications: one needs to be very careful not to invalidate iterators (which are raw pointers to the underlying buffer) while doing so and adding/removing entries towards the beginning of a large document is expensive due to the memmove().
- I ditched newlib for picolibc and exterminated anything that pulled in the C/C++ standard library locale code (that was over 130 kilobytes of Flash altogether IIRC), which includes among other things C++ streams (they are bad for plenty of other reasons too, but mine was program size).
You seem to have mostly aimed for throughput and raw performance in your benchmarks, which is fine for a generic desktop or server-class system with a MMU and plenty of resources. Just wanna point out that other environments will have different constraints that will mandate different kinds of optimizations, like memory usage (heap/stack/program size), dynamic memory fragmentation, real-time/jitter...