Yes, having worked on one of the out-of-order Intel CPU's, I can tell you that you are correct. Instructions may be "complete", as in their results can be forwarded to later operations, but the instruction isn't "retired" until it is known that it can not raise an exception, or be cancelled because of branch mis-predict, etc. Programmer-visible architectural state as defined in the ISA is not written until instruction retirement. CPU re-ordering instructions is not going to change semantics (in X86 and similar architectures... there are some archs that relax that guarantee).
Compilers are notorious for doing dumb things around locks.... the gnu C for AVR architecture, for instance, looks at the SEI instruction (SEt Interrupt mask bit) and notices that it doesn't modify memory or registers, so hoists it to the top of functions. Eh.. No, SEI; CLI; <code> <critical section> <code> is not what I intended...
Also... CPU's with data caches can to smart things with architecturally-defined locking instructions such as "test-and-set' or 'compare-and-exchange' such that the instructions are always cache-coherent across CPU's. If you try to roll-your-own locking code, you had best understand how the cache invalidation mechanism works in your chosen CPU or you are going to have a bad day.
> CPU's with data caches can to smart things with architecturally-defined locking instructions such as "test-and-set' or 'compare-and-exchange' such that the instructions are always cache-coherent across CPU's. If you try to roll-your-own locking code, you had best understand how the cache invalidation mechanism works in your chosen CPU or you are going to have a bad day.
What do you mean? Are you implying that read-modify-writes are treated differently from plain writes by the cache coherency protocol?
I'm saying that an atomic RMW is going to get the cache line in "exclusive" state, (in typical MESI protocol) but that if you are trying to gin up the equivalent with spin locks you need to think through how that plays out, as the reads might be in "shared" state.
I still don't see what you're getting at. What is the implication of this for software? The implementation of the cache coherency protocol is largely opaque to software.
Compilers are notorious for doing dumb things around locks.... the gnu C for AVR architecture, for instance, looks at the SEI instruction (SEt Interrupt mask bit) and notices that it doesn't modify memory or registers, so hoists it to the top of functions. Eh.. No, SEI; CLI; <code> <critical section> <code> is not what I intended...
Also... CPU's with data caches can to smart things with architecturally-defined locking instructions such as "test-and-set' or 'compare-and-exchange' such that the instructions are always cache-coherent across CPU's. If you try to roll-your-own locking code, you had best understand how the cache invalidation mechanism works in your chosen CPU or you are going to have a bad day.