The short version is that if atomics are implemented as part of the memory network, common cache, or memory controller, then atomics of the form “fetch-and-X” can be implemented in roughly equivalent complexity to a load of the current value (plus an instruction for the op, give it take) with the cost only scaling past that as op queues or other implementation-specific limits fire. It’s the infinite consensus ops that just can’t scale no matter what you do. The coherence and memory model matter a lot too of course, which is part of why x86 tends to be slow for atomics, while arm and ppc (with fetch-and-X extensions) or GPUs tend to do much better.
The short version is that if atomics are implemented as part of the memory network, common cache, or memory controller, then atomics of the form “fetch-and-X” can be implemented in roughly equivalent complexity to a load of the current value (plus an instruction for the op, give it take) with the cost only scaling past that as op queues or other implementation-specific limits fire. It’s the infinite consensus ops that just can’t scale no matter what you do. The coherence and memory model matter a lot too of course, which is part of why x86 tends to be slow for atomics, while arm and ppc (with fetch-and-X extensions) or GPUs tend to do much better.