> See how we added a new fence() call to put()? That fixes the reordering proble...

liblfds-temp · on May 12, 2024

A thread which performs a write, has a fence, and then performs another write, will at the time of the second write due to the fence guarantee the first write has completed.

However, "completed" is misleading.

The write will still not be seen by readers.

"Complete" really means "the writer has done all the work he can do, which is necessary but insufficient".

For a reader to see the write, the readers must issue a read memory barrier before reading the value.

All of this is due to store buffers and cache invalidation requests.

In short, in principle, a write is known about only by the processor it occurred upon; if there are any other processors in the system, then unless special magic is performed (fences and so on), any writes they see from other processors are seen purely by chance and can happen in any order (including going backwards in time), or not happen at all.

I once read an article which framed all this in terms of source control.

Memory is like SVN, or Git.

You make local changes (on your processor, which has its local copy of memory). You then commit (write fence/atomic operation). No one else can see your changes until they update from source control and get the latest version (read fence).

gpderetta · on May 13, 2024

> You make local changes (on your processor, which has its local copy of memory). You then commit (write fence/atomic operation). No one else can see your changes until they update from source control and get the latest version (read fence).

That's actually not really true, what you are describing is a non-coherent system.

On a cc system, once a store is no longer speculative, it is guaranteed to be flushed out of the store buffer into the cache, and a load that reaches the cache layer is guaranteed to see the last version of the data that was stored in that memory location by any coherent agent in the system.

As pointed out elsethread, you need load barriers specifically to order your loads, so they are needed for ordering, not visibility.

The way that barriers contribute to visibility (and the reason that they need to be paired) is by giving conditionally guarantees: T: S.1; #StoreStore; S.2 T2: L2 #LoadLoad L.1. If T2 observers from its first load at memory location .2 the value stored by S.2, then it is guaranteed that L.1 will observer the value stored by S.1. So cache coherency plus purely local load and store barriers give you the global Acquire/Release model.

liblfds-temp · on May 14, 2024

> On a cc system, once a store is no longer speculative, it is guaranteed to be flushed out of the store buffer into the cache,

That's the thing - I may be wrong, but I'm under the impression store buffers do not guarantee anything. I'm pretty certain they do not on Intel, for example. All you read in the docs is "flushed in a reasonable time", which can mean anything.

> and a load that reaches the cache layer is guaranteed to see the last version of the data that was stored in that memory location by any coherent agent in the system.

Yes.

> As pointed out elsethread, you need load barriers specifically to order your loads, so they are needed for ordering, not visibility.

Mmm - again, I may be wrong; but I think also no guarantees about behaviour of handling of cache invalidation requests. Given no guarantee, then in theory the invalidation request is never handled (unless you force it to be with a read barrier).

When I need a set of data to be written (say a struct - something bigger than you can manage in an atomic op), I'll write to the struct, fence, then perform a throw-away atomic op. The atomic op forces a write (a normal write could just be deferred and not force completion of pre-barrier writes) and then I know my struct has gone out past the store buffer and has reached cache control.

gpderetta · on May 14, 2024

The cpu still need to guarantee forward progress. The store buffer is always flushed as fast as possible (i.e. as soon as the cpu can acquire the cacheline in exclusive mode, which is guaranteed to happen in a finite time). I can't point you to the exact working in the intel docs as they are quite messy, but you can implement a perfectly C++11 compliant SPSC queue purely with load and stores on x86 without any fences or #LOCK operations.

What fences (and atomic RMWs) do on intel is to act as synchronization, preventing subsequent reads to complete before any store preceding the fence. This was originally implemented simply by stalling the pipeline, but these days I suspect the loads have an implicit dependency on a dummy store on the store buffer representing the fence (possibly reusing the alias detection machinery?).

liblfds-temp · on May 20, 2024

> I can't point you to the exact working in the intel docs as they are quite messy, but you can implement a perfectly C++11 compliant SPSC queue purely with load and stores on x86 without any fences or #LOCK operations.

I would disagree. I think the Intel docs do not specify a guarantee of flushing, and so if the SP and SC are on different cores, I think then in principle (but not in practise) the SC could in fact never see what the SP emits.

Veserv · on May 12, 2024

You do not need a load barrier to observe stores from other cores. You need the load barrier so your own core does not reorder loads otherwise you could see wacky loads even if the stores are totally ordered.

As a example, suppose we have totally ordered stores, but loads can be reordered freely. Suppose X is 0 and Y is 0, then we store X to 1 then store Y to 1. Therefore, Y can only be 1 after X is 1. But, if you can load out of order, then you can load Y prior to the store, then load X after the store and see Y is 0 and X is 1 even though the store ordering guarantees Y is 1 only after X is 1.

liblfds-temp · on May 14, 2024

> You do not need a load barrier to observe stores from other cores. You need the load barrier so your own core does not reorder loads otherwise you could see wacky loads even if the stores are totally ordered.

Yes. I mean, you need the load barrier to see stores safely. Also, in principle, since AFAIK there are no guarantees about cache invalidation request behaviour, in theory it could fail to be read forever - in which case you would in fact literally never see the writes on other cores (until you issued a read barrier).

moonchild · on May 12, 2024

That's more correct, but there are two points of order.

The first is that a cpu core is a massively distributed system, and there is no single point in time when an operation is executed.

The second is that cpus will absolutely physically do reads out of order even when those reads logically have to happen in order. Suppose you read X, then fence, then read Y; the cpu may read Y immediately, immediately begin speculatively executing dataflows depending on Y, then later on read X. But then if, between the time when it read Y and it read X, Y changed, then it will roll back any intermediate computation dependent on Y and try again. But if Y didn't change between the time when it was initially read and the time when X was read, then it's semantically the same as if Y was read after X, so there is no problem.

gpderetta · on May 14, 2024

> The first is that a cpu core is a massively distributed system, and there is no single point in time when an operation is executed.

this is critical: there might be hundreds of cycles between an instruction entering the pipeline and existing it, so you have to be precise what "ordering" means. Typically the only thing it matters is visible side effects, i.e access to the coherence layer.