The cpu still need to guarantee forward progress. The store buffer is always flushed as fast as possible (i.e. as soon as the cpu can acquire the cacheline in exclusive mode, which is guaranteed to happen in a finite time). I can't point you to the exact working in the intel docs as they are quite messy, but you can implement a perfectly C++11 compliant SPSC queue purely with load and stores on x86 without any fences or #LOCK operations.
What fences (and atomic RMWs) do on intel is to act as synchronization, preventing subsequent reads to complete before any store preceding the fence. This was originally implemented simply by stalling the pipeline, but these days I suspect the loads have an implicit dependency on a dummy store on the store buffer representing the fence (possibly reusing the alias detection machinery?).
> I can't point you to the exact working in the intel docs as they are quite messy, but you can implement a perfectly C++11 compliant SPSC queue purely with load and stores on x86 without any fences or #LOCK operations.
I would disagree. I think the Intel docs do not specify a guarantee of flushing, and so if the SP and SC are on different cores, I think then in principle (but not in practise) the SC could in fact never see what the SP emits.
What fences (and atomic RMWs) do on intel is to act as synchronization, preventing subsequent reads to complete before any store preceding the fence. This was originally implemented simply by stalling the pipeline, but these days I suspect the loads have an implicit dependency on a dummy store on the store buffer representing the fence (possibly reusing the alias detection machinery?).