I'm not sure about how atomics having the performance of loads mean they will scale (and to be honest i doubt they could have the perf of loads on modern architecture, otherwise why would e.g. Intel not implement them to be faster - but lets pretend it is possible)
The fine article shows that a single lock xadd can destroy perfs on some x86 systems and explain that it is due to cache line bouncing. You would get the same effect with loads: if the loaded data is RO or mostly RO it will of course scale fine. It won't scale as soon as it starts bouncing too much.
Then why keep bouncing it? Leaving it managed by a known single user in a cache architecture like on x86 means that latency goes up, but the overall throughput of the operation goes up drastically. That’s why flat combining data structures are so popular there despite their absolute maximum throughput being bounded by sequential performance.
Also, FWIW, intel largely does implement operations closer to that way on single socket parts, if you want to see it for real look at on-device atomics on a GPU. Ironically an average laptop chip handles atomics much faster than most servers as a result.
The fine article shows that a single lock xadd can destroy perfs on some x86 systems and explain that it is due to cache line bouncing. You would get the same effect with loads: if the loaded data is RO or mostly RO it will of course scale fine. It won't scale as soon as it starts bouncing too much.