Why RISC-V uses LR/SC instead of CAS for atomic memory operations?

nynx · on June 29, 2021

RISC-V is such a breath of fresh air compared to other ISAs. It's well thought out, elegant, and effectively solves issues other archs have. It kind of reminds me of rust in that way actually.

More to the point, the LR/SC instructions are an implementation of transactional memory, but only for single words. This is pretty exciting, since transactional memory is a possible way to make writing concurrent programs faster (and make the programs themselves faster).

ncmncm · on June 30, 2021

LR/SC (aka LL/SC) was popular in CPUs in the '90s. It turned out to be a performance bottleneck, and was abandoned in modern designs. It is as if they knew enough to leave behind the branch delay slot, but didn't get the memo about other trouble spots.

This is another example of what seems like tone-deaf design right at the innermost core of RISC-V, and why I think we need an incompatible Risc-6 that fixes the core booboos but keeps enough else that the correctness proofs are easy to port over.

Another example is that the microarchitecture can't tell if it is doing an ordinary indirect branch or a function return. While these operations seem the same to a coder, they are very, very different from implementation performance and even security standpoints, thus deserving separate opcodes.

Somewhere there is a published list including these and others, that would be a good starting document.

The whole business of not using condition codes is another example. Condition codes were momentarily a problem for out-of-order execution, until the condition code register got shadowed like others, and now is no problem. But RISC-V's whole ISA is rejiggered around eliminating condition codes, uselessly.

Faithful readers here know about my horror that popcount is missing from the core instruction set, so that code generators can't count on having it available for its 10x performance boost of remarkably common operations.

brucehoult · on June 30, 2021

I've very curious what you consider to be "modern designs". The only important ISAs designed in the 21st century are ARMv8 and RISC-V.

Both use LL/SC (called Load-Exclusive/Store-Exclusive in ARM).

RISC-V from the outset also specified Atomic Memory Operations such as AMOADD, AMOMIN, AMOMAX, AMOOR, AMOAND, AMOXOR. Implementations are permitted to trap the AMOs and implement them using LL/SC if they wish. (The reverse isn't feasible). ARM added similar operations (and CAS, CASP) only in FEAT_LSE (Large Systems Extensions) in ARMv8.1-A, and only in A64.

I don't know what IA64 did, but it's dead anyway, so who cares?.

The next newest important ISA is POWER/PowerPC, which uses LL/SC.

> Another example is that the microarchitecture can't tell if it > is doing an ordinary indirect branch or a function return.

In fact it can and, except for a couple of minimal Cortex-M0 competitors, all RISC-V SoCs I know of, right back to the now ancient FE-310 and FU-540, have return address prediction which works very effectively and does not use resources in the indirect branch predictor.

> Condition codes were momentarily a problem for out-of-order execution

That's one opinion. Another opinion, which both Intel and ARM seem to share, is that "compare and branch" should have been one instruction, and both of their recent designs fuse a compare and a following conditional branch into a single uOp.

Using the same condition code result to control multiple branches is a rare thing, not worth optimising for.

The really amusing thing here is that I'm pretty sure you criticise RISC-V ideas to implement "missing" instructions or addressing modes using fusing of simpler instructions.

> Faithful readers here know about my horror that popcount is missing from the core instruction set

Neither the ARMv8 integer instruction set nor NEON have popcount.

ARM SVE and RISC-V V both have popcount on execution masks (only)

> code generators can't count on having it available for its 10x performance boost of remarkably common operations.

Single-cycle popcount is very very expensive. Recent Intel CPUs have a 3 cycle latency for popcount.

You can do a popcount using simple operations ... shift, and, add ... in 17 instructions, or 12 instructions including one multiply. On a superscalar machine with the typical 4 cycle latency for multiply, that version has a total latency of 14 clock cycles.

The built in POPCNT instruction is therefore only 14/3 = 4.7 times faster than the sequence of simple instructions. A similar ratio applies to throughput if you are doing several in parallel.

There is no way that any actual algorithm with popcount as a component has that 4.7:1 ratio. There will be other instructions. The ratio will be lower.

10x is simply impossible.

I'm curious what you think these remarkably common operations are.

ncmncm · on July 1, 2021

Surely someone who knows so much understands the difference between 3-cycle pipelined latency, where you can have 3 results in 3 cycles, and 3-cycle execution, where only one result is available every 3rd cycle. Thus, comparing 14 cycles pipelined to 3-cycle latency is disingenuous at best.

The facts lead me to disbelieve that you are, in fact, curious.

brucehoult · on July 1, 2021

Of course I understand it.

And as that 12 instruction popcount implementation is almost totally serial, on a typical 3 or 4 wide modern OoO x86 CPU you can do 3 or 4 popcount calculations in those 14 clock cycles (or maybe 16 or something), so the ratio remains basically the same no matter whether you're talking about latency or throughput. As I SAID in my previous post.

andikleen2 · on June 29, 2021

Even if they don't livelock retries can be significantly slower than a direct atomic operation under contention. Compare https://www.cs.tau.ac.il/~mad/publications/ppopp2013-x86queu...

Programming under high cache line contention is like message passing on a really busy network with many nodes, and anything that cuts down round trips can make a big difference in scalability. Most people who do network programming know these lessons by heart, but it's still poorly understood by people doing shared memory.

So maybe it's simpler, but likely it's slower too.

brucehoult · on June 30, 2021

I'm not sure what it is you're arguing is "maybe simpler, but likely slower too". LL/SC?

The paper you're referring to isn't about LL/SC at all. It's about CAS vs fetch-and-add (called AMOADD on RISC-V), showing that F&A scales better than C&S.

LL/SC is referred to only in that some ISAs (including ARM and RISC-V) use LL/SC to implement C&S. But then it goes on to look at x86 exclusively.

Not only the instructions matter here but also implementation.

For example, the way that LL/SC can livelock is that the LL needs to cause that cache line to be acquired for writing, flushing it from other CPUs caches. If another CPU does the same to you before you get to the SC then you lose out and have to loop and try again.

RISC-V provides a "forward progress guarantee" if there are 16 or fewer instructions between the LL and SC, they are "simple" instructions, and they are all in the same cache line.

One simple way to implement this is to delay responding to a cache eviction request for up to 16 clock cycles if you have a LL on that cache line. Then the only way the SC fails (for properly written code) is if you take an interrupt.

The design and implementation of the Atomic Memory Operations is also interesting. RISC-V was co-designed with the TileLink bus. TileLink comes in three flavours:

1) TL-UL Uncached Lightweight: simple get/put transactions

2) TL-UH Uncached Heavyweight: adds support for atomic operations

3) TL-C Cached: adds support for coherent cache blocks

With TL-C, atomic memory operations don't have to be executed by fetching the data all the way to the CPU, do the operation, and write it back. With, for example, an AMOADD (atomic add), both the address and the data to be added are sent over the TL-C bus and the add is executed in a shared L2 or L3 cache or even potentially (it's up to the system designer) directly in the L1 cache of another CPU. Only the result of the add is sent back. The latency is potentially the same as a simple memory read.

TL-C is obviously used inside an SoC, but can also be used over a link such as FMC (FPGA Mezzanine Card), PCIe, or 400G Ethernet (e.g. Western Digital's "OmniXtend").

TL-UH would typically be implemented by peripherals so they can implement AMOs directly on their registers.

TileLink:

https://sifive.cdn.prismic.io/sifive%2Fcab05224-2df1-4af8-ad...

Other buses implement similar features, but this is the one I'm more familiar with.

crote · on June 29, 2021

If, like me, you have no idea what's going on here:

LR/SC stands for load-reserved/store-conditional, also called load-linked/store-conditional.

In a traditional atomic implementation using Compare-and-Swap, the order of execution is as follows:

1. Read value X into register A. 2. Do computation using register A, creating a new value in register B. 3. Do a compare-and-swap on value X: If X == A, then set X to B. The operation was successful. If X != B, another thread changed X while we were using it, so the operation failed. Rollback and retry.

This suffers from the ABA problem: it does not detect the case where another thread changes X to a new value C, but then changed it back to A before the compare-and-swap happens.

An example:

  def push(stack, val):
    1. new = malloc()
    2. new->data = val
    3. orig = atomic_load(&stack->head)
    4. new->next = orig
    5. compare_and_swap(&stack->head, new, orig) // todo retry on CAS failure
  
  def pop(stack):
    1. orig = atomic_load(&stack->head)
    2. new = orig->next
    3. compare_and_swap(&stack->head, new, orig) // again, retry skipped for brevity
    4. val = orig->data
    5. free(orig)
    6. return val

This allows for the following order of execution:

  Our stack contains 3 items at 0x01, 0x02, 0x03.
  Thread 1 calls pop

  1. orig = atomic_load(&stack->head)          // orig = 0x01
  2. new = orig->next                          // new = 0x02

  Context switch. Thread 2 calls pop twice.

  1. orig = atomic_load(&stack->head)          // orig = 0x01
  2. new = orig->next                          // new = 0x02
  3. compare_and_swap(&stack->head, new, orig) // stack->head = 0x02
  5. free(orig)                                // 0x01 is freed

  1. orig = atomic_load(&stack->head)          // orig = 0x02
  2. new = orig->next                          // new = 0x03
  3. compare_and_swap(&stack->head, new, orig) // stack->head = 0x03
  5. free(orig)                                // 0x02 is freed

  Context switch. Thread 3 calls push.

  1. new = malloc()                            // new = 0x01. Valid, it was just freed.
  3. orig = atomic_load(&stack->head)          // orig = 0x03
  4. new->next = orig                          // new->next = 0x03
  5. compare_and_swap(&stack->head, new, orig) // stack->head = 0x01

  Context switch. Thread 1 continues pop

  3. compare_and_swap(&stack->head, new, orig) // stack->head = 0x02
  5. free(orig)                                // 0x01 is freed

Boom. stack->head now points to 0x02, which has already been freed!

LR/SC avoids this by having "lr" reserve a memory location. On calling "sc", if any write to it occured since "lr", the write will fail - even if the content when "sc" is called is identical to the content when "lr" was called.

Please feel free to correct me if I got anything wrong - this is what I found when trying to understand it myself!

pajko · on June 29, 2021

There's a neat solution to overcome the ABA problem - tagged (versioned) pointers. Define the pointer as a conjugate of a version and a raw pointer and every time a new value is being set for the raw pointer part, increase the version.