No matter how many times I read about these, I'm always left just slightly confu...

neerajsi · on May 11, 2024

I just wanted to clarify something about flushing caches: fences do not flush the caches in any way. Inside the CPU there is a data structure called the load store queue. It keeps track of pending loads and stores, of which there could be many. This is done so that the processor can run ahead and request things from the caches or to be populated into the caches without having to stop dead the moment it has to wait for any one access. The memory fencing influences how entries in the load store queue are allowed to provide values to the rest of the CPU execution units. On weak orderes processors like ARM, the load store queue is allowed to forward values to the execution pipelines as soon as they are available from the caches, except if a store and load are to the same address. X86 only allows values to go from loads to the pipeline in program order. It can start operations early, but if it detects that a store comes in for a load that's not the oldest it has to throw away the work done based on the speculated load.

Stores are a little special in that the CPU can declare a store as complete without actually writing data to the cache system. So the stores go into a store buffer while the target cache line is still being acquired. Loads have to check the store buffer. On x86 the store buffer releases values to the cache in order, and on ARM the store buffer drains in any order. However both CPU architectures allow loads to read values from the store buffer without them being in the cache and without the normal load queue ordering. They also allow loads to occur to different addresses before stora. So on x86 a store followed by a load can execute as the load first then the store.

Fences logically force the store buffer to flush and the load queue to resolve values from the cache. So everything before the fence is in the caching subsystem, where standard coherency ensures they're visible when requested. Then new operations start filling the load store queue, but they are known to be later than operations before the fence.

king_geedorah · on May 11, 2024

That clarifies fences more for me a little bit more. Thanks for the insight.

JonChesterfield · on May 11, 2024

If you program without fences, instructions are reordered by the compiler and/or the processor as they see fit provided the single threaded semantics are unchanged. This is why alias analysis is a big deal. A store can be moved before a load if they are independent, i.e. do not alias. A store followed by a load can be simplified to use the value already in a register if there is no other store in the meantime.

This doesn't work if there are multiple threads and shared mutable state. Whatever semantics the programmer had in mind, and encoded in load/store patterns, are usually insufficient for correctness under arbitrary interleaving of threads.

This is fixed by introducing additional constraints on which instructions can be reordered. Fences affect loads, stores, both. Usually with respect to all memory but potentially only a subset of it. The point is to say that moving some operation past some other one will cause unacceptable behaviour for this program, so neither compiler nor CPU shall do so.

On top of this there's a C++ memory order model, where you can tag an integer add with acq_rel semantics, or specify fences, all reasoned in terms of synchronises with and a global oracle determining acceptable execution sequences. I think this is grossly over complicated and heavily obfuscates the programming model to no gain. Fortunately one can mechanically desugar it into the fences and reason with the result.

king_geedorah · on May 11, 2024

Would it be correct to say that acquire-release is in some sense "higher level" than memory fences, with acq-rel implemented in terms of fences (and restrictions on code-gen?) and fences being all that the CPU actually knows about at least in the case of x86_64?

gpderetta · on May 12, 2024

Acquire/release is higher level because it is more of a theoretical description than an hardware description. But acq/rel is not implemented using fences on x86. On this arch all loads and stores have implicit acquire and release semantics respectively, while all RMW are sequentially consistent acq+rel. The compiler will need to emit explicit fences very rarely on x86.

king_geedorah · on May 12, 2024

Okay, that makes sense to me. Thank you.

moonchild · on May 11, 2024

> If I were writing a program in C99, I would assume it would still be possible to communicate the same intent / restrictions to the compiler, but I'm not sure because I haven't been able to find any resources that discuss doing so

You cannot. See boehm, 'threads cannot be implemented as a library' (https://web.archive.org/web/20240118063106if_/https://www.hp...). You can do that in c11, however, which includes functionally the same facilities as c++11 (howbeit c-flavoured, obviously). https://en.cppreference.com/w/c/atomic

> does target architecture influence the compiler's behavior in this regard at all? For example, if we take x86/x86_64 as having acquire/release semantics without any further work, does telling the compiler that my target architecture is x86/x86_64 imply that those semantics should be used throughout the program?

It does not imply that. You should completely ignore the target architecture. C11 provides an abstract interface and a set of abstract constraints for concurrent programs; you should program against that interface, ensuring that your code is correct under those constraints. The compiler is responsible for making sure that the constraints are satisfied on whatever target you happen to run (so your code will be portable!).

> if I have a writer writing X, then writing Y, then I need write-release to make sure that the compiler actually puts the instructions for Y after the instructions for X in the machine code

You need Y to be a write-release if you would like it to be the case that, if another thread who acquire-reads and observes Y, then it will observe X. (The classic example is 'message passing', where X is a message and Y is a flag saying that there is a message. Obviously, it would be bad if you could see the flag but not actually the message.) But maybe you don't need that property.

> If I want to guarantee that the results of those writes are visible to another thread, then I need a memory fence to force flushing of the caches out to main memory basically?

No. That's not what a fence does and that's not how caches work.

king_geedorah · on May 11, 2024

Claro. Thank you for the threads as libraries link; very interesting.

cryptonector · on May 12, 2024

And yet people did threaded programming in C before C11. Granted, you cannot do it in plain C99 -- in practice extensions were used. The hard part wasn't getting the fence instructions (asm will do it) but getting the compiler to not re-order things around fences, and `asm volatile ("" ::: "memory")` (or similar) would do that.

king_geedorah · on May 12, 2024

This is essentially what I was getting at. Thank you.

gpderetta · on May 11, 2024

> Acquire/release is about the ordering of instructions within a thread of execution

acquire/release is about visibility. Acquire and release always go in pair. You can't really reason purely about a release and an acquire in isolation and that's why simply thinking about instruction reordering is not enough.

> So if I have a writer writing X, then writing Y, then I need write-release to make sure that the compiler actually puts the instructions for Y after the instructions for X in the machine code. If I want to guarantee that the results of those writes are visible to another thread, then I need a memory fence to force flushing of the caches out to main memory basically

Whether an explicit memory fence is needed or not depends on the architecture (for example you do not need them on x86). But you do not need to care, if you use the atomic operation with the correct semantic, the compiler will insert any required fence for you.

As an aside, typically fences have nothing to do with caches. One a store or a load operation hits the cache, the coherence system takes care that everything works correctly. If fences had to flush the cache, they would be orders of magnitude slower.

Instead fences (explicit or otherwise) make sure that either memory operations commit (i.e. are visible at the cache layer) in the expected order or that an application can't tell otherwise, i.e. reordering is still permitted across fences as long as conflicts can be detected and repaired, typically this can only happen for loads that can be retried without side effects.

> Whenever I see these concepts discussed, it is in the context of the C++ stdatomic library. If I were writing a program in C99, I would assume it would still be possible to communicate the same intent / restrictions to the compiler

formally in C99 multithreaded programs are UB. Of course other standards (POSIX, openmp) and implementations (the old GCC __sync_builtins) could give additional guarantees; but only C11 gave a model defined well enough to reason in depth about the overall CPU+compiler system; before that people just had to make a lot of assumptions.

> Finally, does target architecture influence the compiler's behavior in this regard at all? For example, if we take x86/x86_64 as having acquire/release semantics without any further work, does telling the compiler that my target architecture is x86/x86_64 imply that those semantics should be used throughout the program?

It does, but note that the compiler will only respect acquire/release semantics for atomic objects operations with the required ordering, not normal load and stores.