Strength of memory model refers to concurrency guarantees. Also gets deep into caching architectures, OOO, and inter-cpu communications.
This is a deep rabbit hole because modern processors are speculatively executing THOUSANDS of instructions ahead. So what you have/haven't written to memory is slightly existential.
Admittedly more than I expected, and thanks for sharing, but "20 CPUs executing a peak of 224 instructions each" is a far cry from "modern processors execute THOUSANDS of instructions AHEAD".
It's neither "thousands" of instructions in flight for a single processor, nor is it thousands "ahead" for multiple processors, let alone both. It's like taking 4000 basic single-stage CPUs and claiming they execute thousands of instructions "ahead", which makes no sense.
Agreed, saying thousands was hyperbole. You do have a point. When talking about instruction windows, one should talk only about one core at a time.
Although the idea of "lots of instructions in flight" was right. From programmers' point of view there's no difference between out-of-order windows of tens, hundreds or thousands. One just needs to be prepared that CPU can and will reorder things within limits of the defined memory model. Which does not define limits for out-of-order window size.
Because we're talking about cross-core guarantees of concurrency. So you actually do care about what another, or all cores are doing.
What core is loading what, and what core is storing what.. and what is pending/holding up those loads is actually extremely important from the perspective of atomic guarantees.
Weaker CPU's (Like say POWER7) that do batching writes (you accumulate 256bits of data, then write it all at once) don't communicate what is in their write-out buffer. So you may write to a pointer, but until that CPU does a batched write the other cores aren't aware. You have to do a fence to flush this buffer (in the other cores).
There are some scenarios where the same situation can arise on x64 but its rarer. The intel cache architecture attempts to negotiate and detect when you are/aren't sharing data between cores. So for _most_ writes it uses the same situation as POWER7, but if it can predict your sharing data it'll use a different bus and alert the other CPU directly.
This is why x64 uses a MESIF-esque cache protocol [1][2]. It can tell when data is Owned/Forwarded/Shared between cores.
> Because we're talking about cross-core guarantees of concurrency. So you actually do care about what another, or all cores are doing.
This is mostly about load/store order of an individual core. How individual core decides to order its reads and writes to memory.
> Weaker CPU's (Like say POWER7) that do batching writes (you accumulate 256bits of data, then write it all at once) don't communicate what is in their write-out buffer. So you may write to a pointer, but until that CPU does a batched write the other cores aren't aware. You have to do a fence to flush this buffer (in the other cores).
You can actually do same on x86 by using non-temporal stores. Although you're not talking about store ordering, but about visibility to other cores. A store won't ever be visible to other cores until it at least hits L1 cache controller.
> There are some scenarios where the same situation can arise on x64 but its rarer.
Yup, that's right. That's why x86 (and x64) got mfence and sfence instructions.
> This is why x64 uses a MESIF-esque cache protocol [1][2]. It can tell when data is Owned/Forwarded/Shared between cores.
Reordering happens before cache controller. When cache controller is involved, the store is already in progress.