The short version is that if atomics are implemented as part of the memory network, common cache, or memory controller, then atomics of the form “fetch-and-X” can be implemented in roughly equivalent complexity to a load of the current value (plus an instruction for the op, give it take) with the cost only scaling past that as op queues or other implementation-specific limits fire. It’s the infinite consensus ops that just can’t scale no matter what you do. The coherence and memory model matter a lot too of course, which is part of why x86 tends to be slow for atomics, while arm and ppc (with fetch-and-X extensions) or GPUs tend to do much better.
Interesting--can you link a reference for this?