Some notes: 3: I don't think more decoders should be exponentially more complex,...

hajile · on Aug 11, 2024

3. You can look up the papers released in the late 90s on the topic. If it was O(n log n), going bigger than 4 full decoders would be pretty easy.

6. Not all of those SIMD sets are compatible with each other. Some (eg, SSE4a) wound up casualties of the Intel v AMD war. It's so bad that the Intel AVX10 proposal is mostly about trying to unify their latest stuff into something more cohesive. If you try to code this stuff by hand, it's an absolute mess.

The P proposal is basically DOA. It could happen, but nobody's interested at this point. Just like the B proposal subsumed a bunch of ridiculously small extensions, I expect a new V proposal to simply unify these. As you point out, there isn't really any conflict between these tiny instruction releases.

There is discussion around the 48-bit format (the bits have been reserved for years now), but there are a couple different proposals (personally, I think 64-bit only with the ability to put multiple instructions inside is better, but that's another topic). Most likely, a 48-bit format does NOT do multiple encoding, but instead does a superset of encodings (just like how every 16-bit instruction expands into a 32-bit instruction). They need/want 48-bits to allow 4-address instructions too, so I'd imagine it's coming sooner or later.

Either way, the length encoding is easy to work with compared to x86 where you must check half the bits in half the bytes before you can be sure about how long your instruction really is.

8. There could be some variance, but x86 has this issue too and SO many more besides.

atq2119 · on Aug 11, 2024

The trend seems to be going towards multiple decoder complexes. Recent designs from AMD and Intel do this.

It makes sense to me: if the distance between branches is small, a 10-wide decode may be wasted anyway. Better to decode multiple basic blocks in parallel

dzaima · on Aug 11, 2024

I know the E-cores (gracemont, crestmont, skymont) have the multi-decoder setup; the first couple search results don't show Golden Cove being the same. Do you have some reference for that?

6. Ah yeah the funky SSE4a thing. RISC-V has its own similar but worse thing with RVV0.7.1 / xtheadvector already though, and it can be basically guaranteed that there will be tons of one-off vendor extensions, including vector ones, given that anyone can make such.

8. RVV's vrgather is extremely bad at this, but is very important for a bunch of non-trivial things; existing RVV1.0 hardware has it at O(LMUL^2), e.g. BPI-F3 takes 256 cycles for LMUL=8[1]. But some hypothetical future hardware could do it at O(LMUL) for non-worst-case indices, thus massively changing tradeoffs. So far the compiler approaches are to just not do high LMUL when vrgather is needed (potentially leaving free perf on the table), or using indexed loads (potentially significantly worse).

Whereas x86 and ARM SIMD perf variance is very tiny; basically everything is pretty proportional everywhere, with maybe the exception of very old atom cores. There'll be some differences of 2x up or down of throughput of instruction classes, but it's generally not so bad as to make way for alternative approaches to be better.

[1]: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.h...

hajile · on Aug 11, 2024

I think you may be correct about gracemont v golden cove. Rumors/insiders say that Intel has supposedly decided to kill off either the P or E-core team, so I'd guess that the P-core team is getting layed off because the E-core IPC is basically the same, but the E-core is massively more efficient. Even if the P-core wins, I'd expect them to adopt the 3x3 decoder just as AMD adopted a 2x4 decoder for zen5.

Using a non-frozen spec is at your own risk. There's nothing comparable to stuff like SSE4a or FMA4. The custom extension issue is vastly overstated. Anybody can make extensions, but nobody will use unratified extensions unless you are in a very niche industry. The P extension is a good example here. The current proposal is a copy/paste of a proprietary extension a company is using. There may be people in their niche using their extension, but I don't see people jumping to add support anywhere (outside their own engineers).

There's a LOT to unpack about RVV. Packed SIMD doesn't even have LMUL>1, so the comparison here is that you are usually the same as Packed SIMD, but can sometimes be better which isn't a terrible place to be.

Differing performance across different performance levels is to be expected when RVV must scale from tiny DSPs up to supercomputers. As you point out, old atom cores (about the same as the Spacemit CPU) would have a different performance profile from a larger core. Even larger AMD cores have different performance characteristics with their tendency to like double-pumping AVX2/512 instructions (but not all of them -- just some).

In any case, it's a matter of the wrong configuration unlike x86 where it is a matter of the wrong instruction (and the wrong configuration at times). It seems obvious to me that the compiler will ultimately need to generate a handful of different code variants (shouldn't be a code bloat issue because only a tiny fraction of all code is SIMD) the dynamically choose the best variant for the processor at runtime.

dzaima · on Aug 11, 2024

> Packed SIMD doesn't even have LMUL>1, so the comparison here is that you are usually the same as Packed SIMD, but can sometimes be better which isn't a terrible place to be.

Packed SIMD not having LMUL means that hardware can't rely on it being used for high performance; whereas some of the theadvector hardware (which could equally apply to rvv1.0) already had VLEN=128 with 256-bit ALUs, thus having LMUL=2 have twice the throughput of LMUL=1. And even above LMUL=2 various benchmarks have shown improvements.

Having a compiler output multiple versions is an interesting idea. Pretty sure it won't happen though; it'd be a rather difficult political mess of more and more "please add special-casing of my hardware", and would have the problem of it ceasing to reasonably function on hardware released after being compiled (unless like glibc or something gets some standard set of hardware performance properties that can be updated independently of precompiled software, which'd be extra hard to get through). Also P-cores vs E-cores would add an extra layer of mess. There might be some simpler version of just going by VLEN, which is always constant, but I don't see much use in that really.

janwas · on Aug 11, 2024

> it's a matter of the wrong configuration unlike x86 where it is a matter of the wrong instruction

+1 to dzaima's mention of vrgather. The lack of fixed-pattern shuffle instructions in RVV is absolutely a wrong-instruction issue.

I agree with your point that multiple code variants + runtime dispatch are helpful. We do this with Highway in particular for x86. Users only write code once with portable intrinsics, and the mess of instruction selection is taken care of.

camel-cdr · on Aug 11, 2024

> +1 to dzaima's mention of vrgather. The lack of fixed-pattern shuffle instructions in RVV is absolutely a wrong-instruction issue.

What others would you want? Something like vzip1/2 would make sense, but that isn't much of an permutation, since the input elements are exctly next to the output elements.

janwas · on Aug 11, 2024

Going through Highway's set of shuffle ops:

64-bit OddEven/Reverse2/ConcatOdd/ConcatEven, OddEvenBlocks, SwapAdjacentBlocks, 8-bit Reverse, CombineShiftRightBytes, TableLookupBytesOr0 (=PSHUFB) and Broadcast especially for 8-bit, TwoTablesLookupLanes, InsertBlock, InterleaveLower/InterleaveUpper (=vzip1/2).

All of these are considerably more expensive on RVV. SVE has a nice set, despite also being VL-agnostic.

dzaima · on Aug 11, 2024

More RVV questionable optimization cases:

- broadcasting a loaded value: a stride-0 load can be used for this, and could be faster than going through a GPR load & vmv.v.x, but could also be much slower.

- reversing: could use vrgather (could do high LMUL everywhere and split into multiple LMUL=1 vrgathers), could use a stride -1 load or store.

- early-exit loops: It's feasible to vectorize such, even with loads via fault-only-first. But if vl=vlmax is used for it, it might end up doing a ton of unnecessary computation, esp. on high-VLEN hardware. Though there's the "fun" solution of hardware intentionally lowering vl on fault-onlt-first to what it considers reasonable as there aren't strict requirements for it.

dzaima · on Aug 11, 2024

Expanding on 3: I think it ends up at O(n^2 * log n) transistors, O(log n) critical path (not sure on routing or what fan-out issues might there be).

Basically: determine end of instruction at each byte (trivial but expensive). Determine end of two instructions at each byte via end2[i]=end[end[i]]. Then end4[i]=end2[end2[i]], etc, log times.

That's essentially log(n) shuffles. With 32-byte/cycle decode that's roughy five 'vpermb ymm's, which is rather expensive (though various forms of shortcuts should exist - for the larger layers direct chasing is probably feasible, and for the smaller ones some special-casing of single-byte instructions could work).

And, actually, given the mention of O(log n)-transistor shuffles at http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardo..., it might even just be O(n * log^2(n)) transistors.

Importantly, x86 itself plays no part in the non-trivial part. It applies equqlly to the RISC-V compressed extension, just with a smaller constant.

hajile · on Aug 11, 2024

Determining the end of a RISC-V instruction requires checking two bits and you have the knowledge that no instruction exceeds 4 bytes or uses less than 2 bytes.

x86 requires checking for a REX, REX2, VEX, EVEX, etc prefix. Then you must check for either 1 or 2 instruction bytes. Then you must check for the existence of a register byte, how many immediate byte(s), and if you use a scaled index byte. Then if a register byte exists, you must check it for any displacement bytes to get your final instruction length total.

RISC-V starts with a small complexity then multiplies it by a small amount. x86 starts with a high complexity then multiplies it by a big amount. The real world difference here is large.

As I pointed out elsewhere ARM's A715 dropped support for aarch32 (which is still far easier to decode than x86) and cut decoder size by 75% while increasing raw decoder count by 20%. The decoder penalties of bad ISA design extend beyond finding instruction boundaries.

dzaima · on Aug 11, 2024

I don't disagree that the real-world difference is massive; that much is pretty clear. I'm just pointing out that, as far as I can tell, it's all just a question of a constant factor, it's just massive. I've written half of a basic x86 decoder in regular imperative code, handling just the baseline general-purpose legacy encoding instructions (determines length correctly, and determines opcode & operand values to some extent), and that was already much.