“csinc”, the AArch64 instruction you didn’t know you wanted

jart · on June 7, 2023

I discovered a really cool ARM64 trick today. One thing about x86 that I've found useful on so many occasions is the PCMPEQB + PMOVMSKB + BSF trick that lets me scan the bytes of a string 10x faster. I couldn't find any information on Google for doing PMOVMSKB with ARM, so I've been studying ARM's "Optimized Routines" codebase where I stumbled upon the answer in their strnlen() implementation. It turns out the trick is to use `shrn dst.8b, src.8h, 4` which turns a 128-bit mask into a 64-bit mask. You can then get the string offset index with fmov, rbit, clz and finally shift by 2.

danlark · on June 7, 2023

I am the author of this trick as well

You can read about it in https://community.arm.com/arm-community-blogs/b/infrastructu...

jart · on June 7, 2023

Wow. I love your work. Thank you for coming here and talking about it. You could write Hacker's Delight 2nd edition for a new generation.

spicymaki · on June 7, 2023

Hacker's Delight already has a 2nd edition.

https://www.oreilly.com/library/view/hackers-delight-second/...

tr33house · on June 7, 2023

would totally love to read a modern `Hacker's Delight`. My mind was so blown away the first time I learned about low-level optimizations. I wish I did more of that on a day to day

zX41ZdbW · on June 7, 2023

Let's add it to ClickHouse: https://github.com/ClickHouse/ClickHouse/blob/master/base/ba...

It should significantly improve the performance on ARM.

Sesse__ · on June 7, 2023

The VSHRN trick is nice (I used it only two hours ago!), but it really does feel like a crutch; I don't understand why they couldn't simply implement a PMOVMSKB-like instruction to begin with (it cannot possibly be very expensive in silicon, at least not if it moved into a vector register). One-bit-per-byte is really the sweet spot for almost any kind of text manipulation, and often requires less setup/post-fixup on either side of the POVMSKB/VSHRN.

jeffreygoesto · on June 7, 2023

> However, developers often encounter problems with Arm NEON instructions being expensive to move to scalar code and back.

I remember talking to an ARM engineer easily 10 years ago and he told us in that nice british accent: "You know, NEON is like 'back in the yard'" :-D. This has changed a lot, but not enough from what you wrote... Bit sad that these SIMD optimizations are still hand written...

camel-cdr · on June 7, 2023

I found the following article about the topic really good: https://branchfree.org/2019/04/01/fitting-my-head-through-th...

In my experience using a 512 wide movemask (to uint64_t) is the fastest on both x86 and arm64. (Edit: just yo clarify, I meant the fastest for iteration, things like SwissMap are better off using 128 wide movemask)

With rvv you don't really what to go from a vector mask to a general purpose non vector register, because the vector length may vary. But I found it really useful that vector masks are always packed into v0. So even with LMUL=8, you can just to a vmseq, switch to LMUL=1 and use vfirst & vmsif & vmandn to iterate through all indices. (Alternatively vfirst & vmsof & vmclr would also work, I'm not sure which one would be faster)

gpvos · on June 7, 2023

I am very surprised that this is presented as something new. From the very beginning of ARM, all instructions have had a condition attached to them. Contrary to the article, it has absolutely nothing to do with making the processor more CISCy, but is instead one of its most RISCy aspects.

Tuna-Fish · on June 7, 2023

All 32-bit ARM opcodes had predication, but when ARM went 64-bit, they wanted to recover the encoding space for 32 instead of 16 registers, and removed predication from most instructions. When they did this, they looked at all the 32-bit ARM binaries they could find, and counted which instructions were actually used with predicates, and added the top 5 of those as separate instructions.

Symmetry · on June 7, 2023

Arguably the two most important features of RISC, for the modern era, were having regular instruction sizes leading to easier parallel decode and having at most one memory access per instruction making the combination of pipelining and precise exceptions much less of a pile of worms.

ARM's 32 bit ISA was very regular and it mostly had a single memory access per instruction, but there were some like store multiple which could potentially save every register to memory. By getting rid of that in A64 and replacing it with an instruction to concatenated two registers and stored them in a single memory access they ended up in a far RISCier place than A32.

klelatti · on June 7, 2023

Interesting - is there a reference for this?

robinsonb5 · on June 7, 2023

Yes, I had similar thoughts when I started reading, but I think only ARM32 has predication. (There's a prefix-instruction-based something or other in Thumb, I think, but it doesn't devote part of the encoding space to predication bits like ARM32 does.)

As I understand it they didn't carry predication across from ARM32 to ARM64 for various performance reasons (if you want to be able to re-order instructions, or even agressively pipeline them, you don't want them depending on the result of the immediately-prior instructon).

Predication everywhere (i.e. orthogonal to the rest of the instruction set, and not special-cased) is certainly more RISC than CISC - but having removed it in general, bringing it back for a few specific instructions is arguably CISCy.

peterfirefly · on June 7, 2023

> There's a prefix-instruction-based something or other in Thumb, I think, but it doesn't devote part of the encoding space to predication bits like ARM32 does.

Yes, the IT (if-then) instruction (prefix). It is not supported by Cortex-M0, Cortex-M0+, and Cortex-M1, though. Those are the smallest T32 (Thumb-2) microcontroller designs ARM has.

IT can be followed by up to 4 instructions and encodes their predicate bits would have been in 32-bit ARM code (A32). There is not total freedom regarding their predicate bits: they all have to share the same ground condition (3 bits) and then get an individual bit that says whether to execute when that condition is met or when it isn't. The IT instruction is a 16-bit instruction that devotes 8 bits to this -- not 7, because the encoding is weird.

peterfirefly · on June 8, 2023

I should add that M0/M0+/M1 have very short pipelines: 2 and 3 stages. That means the cost of a branch isn't all that high so the benefit of predication is small.

(They don't have branch predictors, either.)

unwind · on June 7, 2023

I thought this was interesting, although of course I agree with many commenters' take that the lack of reference to the "old-school" ARM where everything was conditional is odd.

I got curious about how RISC-V handles this, but only curious enough to find [1] and not dig any further. That answer is from a year ago, so perhaps there have been changes.

[1]: https://stackoverflow.com/a/72341794/28169

Findecanor · on June 7, 2023

"cmov" and several more interesting instructions in the draft RISC-V Bitmanip proposal were dropped before it reached 1.0 though.

There is a new proposal: Zicond, but it is quite crude, with two instructions. The "czero.eqz" instruction does:

  rd = (rs2 == 0) ? 0 : rs1;

And the other "czero.nez" tests for "rs2 != 0". Both are supposed to be result in an operand for another instruction, where a zero operand makes it a nop: for conditional add,sub,xor, etc. Conditional move, however, takes three instructions: two results where either is zero which get or'ed together.

https://github.com/riscv/riscv-zicond/blob/main/zicondops.ad...

Otherwise, the intention was that bigger RISC-V cores would detect a conditional branch over a single instruction in the decoder and perform macro-op fusion into a conditional instruction.

klelatti · on June 7, 2023

> Otherwise, the intention was that bigger RISC-V cores would detect a conditional branch over a single instruction in the decoder and perform macro-op fusion into a conditional instruction.

This seems like an overhead compared to actually having the instruction available. Could anyone say how material an overhead this is?

camel-cdr · on June 7, 2023

As far as I know, this is already hardware that implements this. [0]

> […] It is because of a special feature of the U74 that when it sees a short forward branch over exactly one ALU instruction it pairs the two instructions together in the A and B pipelines and instead of predicting whether the branch in the A pipe is taken or not it uses the result of the comparison to predicate the instruction in the B pipe.

> It turns it into a NOP at the last moment, or doesn't write the result back to the destination register or something like that.

Also note that the compressed relative branch instructions only use 16 bytes to encode.

[0] https://www.reddit.com/r/RISCV/comments/132s19s/hand_optimis...

monocasa · on June 7, 2023

> Also note that the compressed relative branch instructions only use 16 bytes to encode.

Bits

Symmetry · on June 7, 2023

It's obviously possible and people have done far more complicated things. It's all a question of much much you're willing to spend on the front end in terms of transistors and engineer-hours.

stefan_ · on June 7, 2023

Not quite cmov but Alibabas T-Head extensions have mveqz (move if equal zero) and mvnez (move if not equal zero).

franky47 · on June 7, 2023

Before reading the article, my former DSP engineer brain kicked in and thought: "complex cardinal sine (sinc), why would you want that?"

https://en.wikipedia.org/wiki/Sinc_function

t8sr · on June 7, 2023

The while loop in the third paragraph is easier to read in assembly than in the original C++, which either says something about how well chosen the instruction set is, or about how bad some of C++ is.

menaerus · on June 7, 2023

Nothing to do with C++ - it's a plain C code as a matter of fact but that's not important at all. What the code does is that it employs low-level intrinsic knowledge about the CPU microarchitecture (x86-64) and compiler codegen ability (clang) so that they can pack as many instructions per cycle as they can so that the resulting (de)compression speed is improved. You cannot write such piece of code so that it looks "beautiful" to an average Joe.

t8sr · on June 7, 2023

Right, but the use of bitwise AND, and the repeated conditional expressions are the kind of weirdness I’d expect a good compiler to not need.

I’ve worked a lot on the kernel and I’m no stranger to optimized code. This is still really weirdly written, and in fact the assembly is much more readable, which is funny.

I know clang needs a lot of prodding to output good code (compared to gcc), but I’m curious whether even clang really needs the logic to be so warped.

layer8 · on June 7, 2023

It’s weirdly written, maybe to mimic conditional machine instructions. It’s also unusual in that it seems to assume that each input array contains each number only once, as it outputs numbers contained in both input arrays only once, but only under that prior assumption.

mtklein · on June 7, 2023

I love seeing this instruction pop up in disassembly. I've seen it come up when growing a dynamic array, with some C code like...

    if (is_pow2_or_zero(len)) {
        int grown = len ? len*2 : 1; 
        ptr = realloc(ptr, (size_t)grown * sizeof *ptr);
    }

compiling into this sort of disassembly to calculate the value of grown:

    lsl    w8, w19, #1      // w8 = len*2
    cmp    w19, #0x0        // is len zero?
    csinc  w8, w8, wzr, ne  // w8 = (w8 if len != 0) or (0+1 if len == 0)

Pretty clever to create that 1 constant using csinc on the wzr zero register.

dougall · on June 7, 2023

Though it'd be preferable to do:

    cmp wzr, w19      // set the carry flag if w19 is zero
    adc w8, w19, w19  // w8 = w19 + w19 + carry

userbinator · on June 7, 2023

Parent is 12 bytes, yours is 8, but x86 can do it in 5:

    add eax, eax
    setz al

moonchild · on June 7, 2023

Nope. If eax is initially 1, then it will be 0 after your sequence, where 2 was desired.

userbinator · on June 7, 2023

My mistake. I was thinking of the other idiom for !. How about this then:

      add eax, eax
      jnz skipinc
      inc eax
    skipinc:

Also 5 bytes.

dougall · on June 8, 2023

Nice!

Sorry if this comment is overly pedantic, I just enjoy having an excuse to talk about assembly.

It's worth noting that 0x80000000 would pass this "is zero" check. (I think this is probably a legal compiler optimisation because signed integer overflow is undefined, but I'm not 100% sure either way.)

Using a jump is also a bit risky - slightly better if it's predictable, much worse if it's unpredictable.

As far as size, this is 5 bytes on 32-bit x86 (as stated), 6 bytes on 64-bit x86, but can be 8 bytes if different registers are used:

    4501C0            add r8d,r8d
    7503              jnz 0x8
    41FFC0            inc r8d

(And, unlike the ARM code, you'd need an additional mov instruction if you wanted to preserve the input value.)

It feels like an ADC-based variant might be possible on x86 too - CMP and ADC are also x86 instructions. The problem is that ARM and x86 invert the value of the carry flag on subtraction (and comparison), so it doesn't translate directly, and I can't immediately see how to fix it up without using more instructions.

mpweiher · on June 7, 2023

Wouldn't this be the ideal instruction for implementing multi-word arithmetic? If the carry flag is set from the previous (lower order) addition, increase the next word up by one and continue adding.

And of course ARM 32 had conditional execution for all instructions. These appear the variants that were useful enough to keep around when the general feature was removed from aarch64

gpvos · on June 7, 2023

ARM has both add-with-carry and add-without-carry instructions, a separate increment is not necessary. (I don't know much about AArch64, only ancient ARM2/3, but I expect they left this in).

Findecanor · on June 7, 2023

Yes AArch64 has "adds" for modifying the carry flag after the first addition and then "adcs" for using and modifying the carry flag in subsequent additions.

wbl · on June 7, 2023

ARM used to have the beautiful UMALL, a single instruction that would multiply two registers then accumulate two other values into the result, then store as a double word into the registers.

This is the inner loop of multiplication and was very nice to use, but died in the AArch64 transition.

throwawaylinux · on June 7, 2023

You have to be careful with turning control dependencies into data dependencies. It can be very hard to understand or predict how a CPU will behave.

If you are testing quite predictable things, you almost always want to use branch prediction and not predicated/conditional instructions.

If something is totally unpredictable, let's say a binary search that is looking up random elements in a well balanced heap or tree. Each comparison is very unpredictable. A conditional select would work best there:

    item = (val < item->val ? item->left : item->right);
    if (val == item->val) ...

You could do your tree walks entirely without branch misses if that first line was a select... But it turns out that is not true. Or it's not necessarily true, depending a few (not uncommon) factors, it can be worse to use a select there.

TekMol · on June 7, 2023

How does software these days target all the different CPUs with different instructions?

If I download, say, debian-11.7.0-amd64-netinst.iso - does it somehow dynamically adapt to all the different AMD and Intel CPUs and uses the instructions available on the users machine?

bruce343434 · on June 7, 2023

Software compiled to be "portable" uses a reduced subset. You actually have to bully GCC into using the full CPU instruction set with -march=native (you can also put another target CPU arch there).

In short, distributed binaries tend to use "least common denominator" instructions.

I believe one of the pros touted of Gentoo, where everything is compiled locally, is that all the software uses the CPU to it's fullest potential.

jart · on June 7, 2023

You can also dispatch at runtime based on CPUID (x86) or getauxval(AT_HWCAP) for ARM. Also Clang and GCC seem to be moving in the direction of removing -march=native which is sad since grokking all the different microarchitectures isn't easy.

ZiiS · on June 7, 2023

Lots of speed sensitive programs also ship multiple implementations they can choose at run time so they can more fully utilize a CPU without recompiling.

another2another · on June 7, 2023

Probably one of the biggest Bang for your Buck (on Linux) would be to recompile your libc and OpenSSL with -march=native. Then at least all software that depends on those libs (probably the majority) would get some benefit from your local processors extensions.

Sesse__ · on June 8, 2023

glibc and OpenSSL both have runtime detection of CPU features to switch between hand-coded versions of hot functions, so those two are probably the packages that will get the _least_ benefit from a recompilation.

trelane · on June 7, 2023

Or you can use Gentoo.

TekMol · on June 7, 2023

Afaik, Debian runs on the 386? And I think that came out in the 80s?

So all new CPU instructions of the last 40 years are pretty much used by nobody?

stephen_g · on June 7, 2023

No, according to this [1], Debian's i386 architecture dropped support for the 386 and 486 in versions Sarge and Squeeze respectively. Pretty amazing that it did work on the first Pentium (released 1993!) up to 2018. In Debian version after Jessie the Pentium (i586) has been dropped too [2].

Of course, I don't think many people are using i386 builds anymore - most people would have switched to the more modern x86_64 long ago.

1. https://www.debian.org/releases/jessie/i386/ch02s01.html.en

2. https://www.debian.org/releases/stretch/i386/ch02s01.html.en

pjc50 · on June 7, 2023

The dirty secret is: yes, many of the new instructions aren't used very often. They only come into play in certain cases: encryption, signal processing, CPU graphics (e.g. paint packages), video/jpeg decoding (when not done on GPU) and so on, most of which are packaged inside libraries which may or may not have multi-CPU implementations.

rerdavies · on June 7, 2023

I think the default configuration for GCC x64 is SSE 4.0 instructions enabled, AVX2 instructions enabled, AVX512 and SSE 4.2 disabled. And I think MSVC defaults to processors of a roughly similar age (10 to 15 years old). So instructions in the 40-to-15 year old range get used a lot.

If you're running software that's heavily math intensive, or ever gets benchmarked, it's a fairly good bet that they will have either conditionally-installed or conditional-executed code that targets more modern processors.

Denvercoder9 · on June 7, 2023

The baseline amd64 System V ABI, which is what Debian targets, only includes SSE and SSE2; not SSE3, SSE4 or AVX2.

Arnt · on June 7, 2023

The new instructions are used often or seldom depending on whether you count invocations or occurences.

There are new instructions that can be used to write a constant-time SHA-256 hash function. A program that contains megabytes of code, more than a million instructions, may reasonably contain only a few of those instructions, because it contains only one or two small hash functions. The instructions are important and effective, but occur only a handful of times among a million instructions.

msla · on June 7, 2023

The Linux kernel dropped support for the original 80386 in December 2012, with kernel 3.8

Debian dropped support with sarge in 2005

https://en.wikipedia.org/wiki/I386

aleden · on June 7, 2023

IFunc relocations are how glibc dynamically chooses the best memcpy routine to use at runtime based on the CPU.

see https://github.com/bminor/glibc/blob/glibc-2.31/sysdeps/x86_...

Findecanor · on June 7, 2023

Here's a article that describes how to use function multi-versioning and indirect functions (ifunc) with GCC and GNU tools:

https://lwn.net/Articles/691932/

r2vcap · on June 7, 2023

In many cases, multiple implementations are included and one is chosen to utilize the best instruction supported by the CPU. Example code: https://source.chromium.org/chromium/chromium/src/+/main:thi...

zokier · on June 7, 2023

There are several uarch levels defined for x86_64 which include newer instructions than the baseline. Some distros are starting to move to use those higher levels, notably RHEL9 is x86_64-v2.

You'll find lots of discussions happening around this topic, for example: https://www.phoronix.com/news/Arch-Linux-x86-64-v3-Port-RFC

zX41ZdbW · on June 7, 2023

Runtime CPU dispatching: https://maksimkita.com/blog/cpu-dispatch-in-clickhouse.html

ksherlock · on June 7, 2023

For a while, submissions to the iOS app store could include bitcode, which was LLVM's intermediate byte code. I don't know if they ever did, but Apple could generate architecture-optimized binaries for their various CPU models. They deprecated that last year, though.

.Net ahead-of-time compilation (that is, compiling the .net / clr VM byte code into something your CPU can run directly) could (but apparently doesn't?) do CPU-specific optimizations. The JIT compiler, however does do some CPU-specific optimizations.

lnx01 · on June 7, 2023

Compiler flags. You turn on/off compiler optimisations for target architectures that are aware of all the instruction-set specific hardware level optimisations.

TekMol · on June 7, 2023

But I'm not compiling. And neither are 99.9% of other software users.

pjmlp · on June 7, 2023

Hence why JITs have some advantages when shipping software, at the cost of a few extra MBs.

bruce343434 · on June 8, 2023

Only if the JIT uses the fancy instructions :)

pjmlp · on June 8, 2023

Most JVM implementations and the CLR do keep up with fancy instructions.

Not all, but surely a few.

kramerger · on June 7, 2023

I am not 100% convinced this will perform as good on every armv8 implementation. Have you tried this on first gen v8 cores such as A53?

I think that is the reason GCC will not use it, although it may if you set the target CPU with -mcpu=

monocasa · on June 7, 2023

Conditional moves tend to work even better on small in-order designs than later OoOE cores.

kramerger · on June 8, 2023

My assumption is that larger designs means longer pipelines and that increases the penalty of missed conditional jumps. On a 3 or 5-stage pipeline things are not as bad.

monocasa · on June 8, 2023

Sort of. The bigger issue is that in larger OoOE cores, the predicate ends up being a speculation barrier if you don't predict it. So you need hardware similar to a branch predictor guessing those predicates. At that point it's easier to just point people towards conditional jumps where you ostensibly have very good branch prediction if you have any sort of predictability and spend your gate budget there where it's more likely to be more generally used.

devit · on June 7, 2023

It looks like the reason this apparently weird instruction exists is that AArch64 has a zero register, meaning you can use csinc with two zero register operands to represent cond ? 1 : 0.

Given that AArch64 has/had no 16-bit instruction support, it probably made sense to provide a generalization of a setcond instruction to make use of the encoding space of 32-bit instructions, and that's one of the most obvious (the other ones being cond ? imm : 0 or cond ? imm : reg).

exabrial · on June 7, 2023

Side Note:

10/10 on the website. Clean simple design and doesn't download 4,124 javascript libraries for the purpose of displaying static content.

Aerbil313 · on June 7, 2023

I wonder how long will it take for all the software to mature to fully be able to use full performance of today’s hardware. I mean all the optimizations in language compilers, OSes and such. 50 years? 1 year after the first AGI coder?

userbinator · on June 7, 2023

Look at the demoscene. They're still exploring the limits of the C64 (1MHz 6502, 64k RAM).

userbinator · on June 7, 2023

Although ARM is marketed as RISC, it does have a lot of CISC-like features. I suspect the designers knew that with fixed-size instructions, they had to pack as much as they could into them to increase code density.

msla · on June 7, 2023

What would be CISC-like is if the opcode operated on memory locations, such that the CPU would have to deal with it taking a page fault.

Anyway, here's John Mashey, who helped design the MIPS, on RISC v CISC:

https://yarchive.net/comp/risc_definition.html

terrelln · on June 7, 2023

Awesome post, TIL about that instruction. I just found myself wanting a `csinc` instruction when optimizing a function to merge sorted lists.

Looking forward to your future posts!

tempodox · on June 8, 2023

Too bad, I thought it was about computing the complex sinus function.

vardump · on June 7, 2023

So, a very useful and versatile instruction. Glad AArch64 got it.

thriftwy · on June 7, 2023

ARM was supposed to be RISC but this sounds BISC - baroque instructions set computer.

Findecanor · on June 7, 2023

I think it is very much in the RISC philosophy to have fewer more powerful, but still simple, instructions which can be combined with operands in complex ways to do a lot of different things.

Another example of this are all the combinations with the hard-coded zero register. For instance, the `cmp` "instruction" in A64 (and many other RISC ISAs) is actually an alias to the `subs` (subtract and set status flags) instruction with the zero-register as destination. The idea of the zero register was so potent that modern CISC x86 processors actually have a physical zero register internally, which olde x86 instructions are translated into using.

Arnt · on June 7, 2023

You've no idea.

The ARM has lots of instructions, each fairly simple. Compare this to an architecture where a single instruction can ① compute the address of its operands in main memory, ② read them, ③ carry out its main operation and eventually ④ write the result to main memory, with most of those steps optional and depending on the arguments supplied.

thriftwy · on June 7, 2023

CSINC and even CMOVBL do not sound "fairly simple".

Maybe it is because the mental model of higher level programmer - for an application programmer, anything that involves writing and reading main memory directly is considered simple, whereas combining a conditional, an increment and a write in a single op sounds "not simple".

thequux · on June 7, 2023

Keep in mind that when the CISC vs RISC distinction was coined, the predominant architectures were things like IBM System/370 and VAX. System/370 has an instruction that inserts an item into a heap. (and modern Z/Architecture is worse: there are instructions like "compute the HMAC of this data" where the MAC algorithm is in a register.) VAX had an instruction that computed the roots of a polynomial.

The world is very different now. RISC won, and it won hard. x86 got a lot more registers, and most of the baroque instructions are no longer used because they got relegated to microcode. As it's used, it's much closer to Berkeley RISC than it is to VAX or S/370. ARM is even more so: the only instructions that touch memory are loads and stores.

As a rough approximation, "simple" instructions can be implemented in a reasonable amount of silicon without microcode, and complex instructions can't.

rerdavies · on June 7, 2023

The thing that I find hardest to get used to on ARM is watching compilers generate four or five or six register-based instructions to implement an addressing mode that could have been executed in a single instruction on x86. It really breaks my mental model of how C/C++ translates into machine code.

sweetjuly · on June 7, 2023

With modern compilers, the mental model of what code translates to is long dead. Operations are reordered and statically scheduled, code can be inlined, outlined, eliminated, or duplicated. Vectorization is a whole other can of worms that can appear in often surprising ways. Mix in security extensions like pointer authentication, branch target identification, or even pure software things like stack cookies and now you're getting code you never even wrote inserted everywhere.

The best approach is always just to verify the disassembly.

rerdavies · on June 16, 2023

I don't really have any problems with mental models for the code I'm going to see on x64 (which does incorporate models of instruction scheduling, multiple issue, elimination (pretty easy), and vectorization, &c).

Maybe it will come with more time on ARM; but three years in, it's still not there.

klelatti · on June 7, 2023

Well all modern architectures have FP and SIMD so it’s not as though complex (microcoded) instructions have been banished.

atq2119 · on June 7, 2023

SIMD instructions are generally not microcoded though because they are not actually complex. A SIMD addition has pretty much the same microarchitectural complexity as a scalar addition.

If you're looking for truly complex instructions, you should look for things like VMENTER or IRET.

stephencanon · on June 7, 2023

Most SIMD and FP instructions are not microcoded in a modern mainstream CPU, FWIW.

rerdavies · on June 7, 2023

@klelatti As far as I know, nobody has SIMD trig functions at the instruction level. FP trig functions are definitely micro-coded, and are (as-far-as-I-know) found only on CISC processors.

(HN seems to have a limit on how deeply you can nest replies. I can't reply to @klelatti's post direct :-/ )

klelatti · on June 7, 2023

Interesting. I sort of expected that for basic FP / SIMD but not for the full range - eg trig functions. Is that wrong?

stephencanon · on June 7, 2023

There really aren't complex math functions in most SIMD ISAs(1), these functions are implemented in software instead. Even for scalar operations, no one(2) uses the x87 trig functions either, as software implementations are both faster and more accurate and have been for a couple decades.

(1) various HPC folks have proposed and used extensions that do _part_ of a complex math operation in SIMD at various times, and on GPUs this sort of thing is very common.

(2) except for math libraries that haven't been updated for a couple decades.

Symmetry · on June 7, 2023

It really is very simple from a hardware standpoint. You have a little box in the execution unit with two regular inputs and one single bit input and one regular output. Once you have adds with carry you already have to have that structure of inputs in place, which has implications throughout your scheduler, but adding CSINC at that point is almost trivial.

Complicated things migth be something like division, which can take multiple cycles during which that functional block is busy, or a floating point addition where there are all sorts of complicated rules involving implicit global state around rounding and subnormal and NaNs. Or, most terrifying of all, a load which might target a memory that's been paged out and so require you to bring all the scores of instructions in flight to a halt, switch over to OS code, page in the memory, and then resume as if that one load instruction was the only thing executing at the time the exception happened.

Arnt · on June 7, 2023

Modern CPUs often keep a hundred instructions in flight, and accessing main memory once can take more time than a hundred simple instructions. Accessing main memory involves cache coherency protocol logic with neighbouring cores, and it may involve locking if the program wants a read barrier, a write barrier or any kind of volatile variable.

Instructions like CMOVBL involve only a small number of CPU registers, nothing else, and can't interact with instructions far ahead or behind them in the instruction stream, or with other cores/threads at all. Very little state. They're simple to reason about, both for the compiler authors, the CPU and the poor developer who's chasing a threading bug.

flohofwoe · on June 7, 2023

FWIW, the original ARM CPU from 1985 (which was a very 'pure' RISC implementation) already could execute each instruction conditionally, the condition bit mask was just part of the regular opcode structure:

https://en.wikichip.org/wiki/arm/armv1

throwawaylinux · on June 7, 2023

It's not a write, it's just a destination register.

There isn't supposed to be some absolute number of instructions that a RISC has, the idea behind it is that you would take a quantitative approach to add instructions, and require that they show benefit beyond a composition of other instructions, and could be practically used.

More capable compilers, wider use of vectorization and other techniques, and more transistors has pushed that a long way since the 1980s.

msla · on June 7, 2023

It's simple, in part, because it only operates on registers and immediate values, so there's no chance of it taking a page fault. That makes it easy to pipeline, and, compared to having to save a lot of ALU state after a fault, the operation this opcode performs is, indeed, fairly simple.