We’re hitting a classic distributed systems wall and I’m looking for war stories or "least worst" practices.
The Context: We maintain a distributed stateful engine (think search/analytics). The architecture is standard: a Control Plane (Coordinator) assigns data segments to Worker Nodes. The workload involves heavy use of mmap and lazy loading for large datasets.
The Incident: We had a cascading failure where the Coordinator got stuck in a loop, DDOS-ing a specific node.
The Signal: Coordinator sees Node A has significantly fewer rows (logical count) than the cluster average. It flags Node A as "underutilized."
The Action: Coordinator attempts to rebalance/load new segments onto Node A.
The Reality: Node A is actually sitting at 197GB RAM usage (near OOM). The data on it happens to be extremely wide (fat rows, huge blobs), so its logical row count is low, but physical footprint is massive.
The Loop: Node A rejects the load (or times out). The Coordinator ignores the backpressure, sees the low row count again, and retries immediately.
The Core Problem: We are trying to write a "God Equation" for our load balancer. We started with row_count, which failed. We looked at disk usage, but that doesn't correlate with RAM because of lazy loading.
Now we are staring at mmap. Because the OS manages the page cache, the application-level RSS is noisy and doesn't strictly reflect "required" memory vs "reclaimable" cache.
The Question: Attempting to enumerate every resource variable (CPU, IOPS, RSS, Disk, logical count) into a single scoring function feels like an NP-hard trap.
How do you handle placement in systems where memory usage is opaque/dynamic?
Dumb Coordinator, Smart Nodes: Should we just let the Coordinator blind-fire based on disk space, and rely 100% on the Node to return hard 429 Too Many Requests based on local pressure?
Cost Estimation: Do we try to build a synthetic "cost model" per segment (e.g., predicted memory footprint) and schedule based on credits, ignoring actual OS metrics?
Control Plane Decoupling: Separate storage balancing (disk) from query balancing (mem)?
Feels like we are reinventing the wheel. References to papers or similar architecture post-mortems appreciated.
mmap() is one of those things that looks like it's an easy solution when you start writing an application, but that's only because you don't know the complexity time bomb of what you're undertaking.
The entire point of the various ways of performing asynchronous disk I/O using APIs like io_uring is to manage when and where blocking of tasks for I/O occurs. When you know where blocking I/O gets done, you can make it part of your main event loop.
If you don't know when or where blocking occurs (be it on I/O or mutexes or other such things), you're forced to make up for it by increasing the size of your thread pool. But larger thread pools come with a penalty: task switches are expensive! Scheduling is expensive! AVX 512 registers alone are 2KB of state per task, and if a thread hasn't run for a while, you're probably missing on your L1 and L2 caches. That's pure overhead baked into the thread pool architecture that you can entirely avoid by using an event driven architecture.
All the high performance systems I've worked on use event driven architectures -- from various network protocol implementations (protocols like BGP on JunOS, the HA functionality) to high speed (persistent and non-persistent) messaging (at Solace). It just makes everything easier when you're able to keep threads hot on locked to a single core. Bonus: when the system is at maximum load, you remain at pretty much the same number of requests per second rather than degrading as the number of threads ready to run starts increasing and wasting your CPU resources needlessly when you need them most.
It's hard to believe that the event queue architecture I first encountered on an Amiga in the late 1980s when I was just a kid is still worth knowing today.
reply