I'm not sure about how atomics having the performance of loads mean they will sc...

trws · on June 18, 2023

Then why keep bouncing it? Leaving it managed by a known single user in a cache architecture like on x86 means that latency goes up, but the overall throughput of the operation goes up drastically. That’s why flat combining data structures are so popular there despite their absolute maximum throughput being bounded by sequential performance.

Also, FWIW, intel largely does implement operations closer to that way on single socket parts, if you want to see it for real look at on-device atomics on a GPU. Ironically an average laptop chip handles atomics much faster than most servers as a result.