This is the citation I most often use from that time, though the primary source ...

This is the citation I most often use from that time, though the primary source is probably another reference down the line: https://dl.acm.org/doi/10.1145/69624.357206

The short version is that if atomics are implemented as part of the memory network, common cache, or memory controller, then atomics of the form “fetch-and-X” can be implemented in roughly equivalent complexity to a load of the current value (plus an instruction for the op, give it take) with the cost only scaling past that as op queues or other implementation-specific limits fire. It’s the infinite consensus ops that just can’t scale no matter what you do. The coherence and memory model matter a lot too of course, which is part of why x86 tends to be slow for atomics, while arm and ppc (with fetch-and-X extensions) or GPUs tend to do much better.