What do you mean, could you elaborate on that a bit? How is x86 "memory model st...

wolfgke · on May 11, 2017

For a short overview read:

> http://preshing.com/20120930/weak-vs-strong-memory-models/

If you want more details of memory models of some CPU architectures, read

> http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...

Note that "x86 OOStore" is not a model that you should be concerned about (according to https://groups.google.com/forum/#!topic/linux.kernel/2dBrSeI... it was only used on IDT WinChip)

For a perspective on memory barriers for different memory models with a focus on the Linux kernel look at

> https://www.kernel.org/doc/Documentation/memory-barriers.txt

If you are specifically interested in some subtile details of the x86 memory model, have a look at

> http://www.cl.cam.ac.uk/~pes20/weakmemory/index3.html

Best begin with

> http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf

---

I hope this should give you enough information to start reading up on the subject.

Splines · on May 12, 2017

I've found Herb Sutter's talk on std::atomic also useful for understanding this.

The 2nd part of his talk goes into some of the differences between x86 and other architectures (about 31 min in):

https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-20...

valarauca1 · on May 11, 2017

Strength of memory model refers to concurrency guarantees. Also gets deep into caching architectures, OOO, and inter-cpu communications.

This is a deep rabbit hole because modern processors are speculatively executing THOUSANDS of instructions ahead. So what you have/haven't written to memory is slightly existential.

Here is a good blog post about it http://preshing.com/20120930/weak-vs-strong-memory-models/

The TLDR: x64 tries REALLY HARD to ensure your pointers always have the newest data. ARMv6, very much. ARMv7 kind of. ARMv8 fairly.

wfunction · on May 11, 2017

Thousands? I thought it was more like a dozen or two at most. Any link on that?

valarauca1 · on May 12, 2017

Sky lakes OoO window is 224 instructions.

With up to 96 in scheduler. 72 loads, and 56 writes.

Link: (summery you find hard intel references internally) https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

224 per core. We're talking concurrency. So a 10core, 20HT server class can have 2240 instructions in flight.

wfunction · on May 12, 2017

Admittedly more than I expected, and thanks for sharing, but "20 CPUs executing a peak of 224 instructions each" is a far cry from "modern processors execute THOUSANDS of instructions AHEAD".

It's neither "thousands" of instructions in flight for a single processor, nor is it thousands "ahead" for multiple processors, let alone both. It's like taking 4000 basic single-stage CPUs and claiming they execute thousands of instructions "ahead", which makes no sense.

vardump · on May 12, 2017

Agreed, saying thousands was hyperbole. You do have a point. When talking about instruction windows, one should talk only about one core at a time.

Although the idea of "lots of instructions in flight" was right. From programmers' point of view there's no difference between out-of-order windows of tens, hundreds or thousands. One just needs to be prepared that CPU can and will reorder things within limits of the defined memory model. Which does not define limits for out-of-order window size.

valarauca1 · on May 12, 2017

>one should talk only about one core at a time.

Nope.

Because we're talking about cross-core guarantees of concurrency. So you actually do care about what another, or all cores are doing.

What core is loading what, and what core is storing what.. and what is pending/holding up those loads is actually extremely important from the perspective of atomic guarantees.

Weaker CPU's (Like say POWER7) that do batching writes (you accumulate 256bits of data, then write it all at once) don't communicate what is in their write-out buffer. So you may write to a pointer, but until that CPU does a batched write the other cores aren't aware. You have to do a fence to flush this buffer (in the other cores).

There are some scenarios where the same situation can arise on x64 but its rarer. The intel cache architecture attempts to negotiate and detect when you are/aren't sharing data between cores. So for _most_ writes it uses the same situation as POWER7, but if it can predict your sharing data it'll use a different bus and alert the other CPU directly.

This is why x64 uses a MESIF-esque cache protocol [1][2]. It can tell when data is Owned/Forwarded/Shared between cores.

[1] https://en.wikipedia.org/wiki/MESIF_protocol

[2] Intel hasn't updated their white papers in 5+ years they're likely using a more advanced protocol.

vardump · on May 12, 2017

> Because we're talking about cross-core guarantees of concurrency. So you actually do care about what another, or all cores are doing.

This is mostly about load/store order of an individual core. How individual core decides to order its reads and writes to memory.

> Weaker CPU's (Like say POWER7) that do batching writes (you accumulate 256bits of data, then write it all at once) don't communicate what is in their write-out buffer. So you may write to a pointer, but until that CPU does a batched write the other cores aren't aware. You have to do a fence to flush this buffer (in the other cores).

You can actually do same on x86 by using non-temporal stores. Although you're not talking about store ordering, but about visibility to other cores. A store won't ever be visible to other cores until it at least hits L1 cache controller.

> There are some scenarios where the same situation can arise on x64 but its rarer.

Yup, that's right. That's why x86 (and x64) got mfence and sfence instructions.

> This is why x64 uses a MESIF-esque cache protocol [1][2]. It can tell when data is Owned/Forwarded/Shared between cores.

Reordering happens before cache controller. When cache controller is involved, the store is already in progress.

vardump · on May 11, 2017

He means x86 doesn't reorder memory store order with other stores or loads with other loads, etc.

See: https://en.wikipedia.org/wiki/Memory_ordering#In_symmetric_m...