In ML workloads, usually the FP16 operations are matrix operations. On RDNA3, these execute at the same rate as normal shader/vector operations, but on Nvidia RTX cards there are Tensor cores which accelerate them. The Ada whitepaper lists 48.7 shader TFlops (not 43 because boost vs base clock), and 195 TFlops for FP16 Tensor with FP16 Accumulate. That's 4 times faster than regular, and almost double what the XTX lists!
Ampere and newer also have native sparsity support which means that you can skip over the zeroes 'for free', which Nvidia uses to market double the TFlops, which is kind of misleading imo. But the 195 TFlops are even before sparsity is included!
I'm not sure if the 93 TFlops (120 with boost clocks) on AMD are with FP16 or FP32 accumulation, as with FP32 accumulation the 4080 slows down significantly and gets much closer with 97.5 TFlops.
Intel Xe-HPG (used in the Arc A cards) also offers very aggressive matrix acceleration via XMX, with 137.6 FP16 TFlops at base clock, vs. 17.2 FP32 TFlops.
You’re comparing general-purpose computations to a proprietary feature with limited applicability. For example, in my Mistral implementation the most expensive compute shader is handling matrices compressed with a custom codec unknown to tensor cores.
This is not an apples-to-apples comparison. We use FLOPS numbers to compare processors of different architectures because they’re the same FLOPS everywhere.
Neither is FLOPS. RDNA3 FLOPS numbers are greatly inflated, because they are only applicable to fairly specific VOPD and WMMA instructions, the former being for two MADs at the same time, and the latter being applicable in the same cases as tensor cores.
Besides, it should be possible to use the tensor cores combined with the codec: you can use vector hardware to decode matrix values to fp16/fp32, then yeet those into the tensor cores. Although most of the time will probably be spent on the decoding part, assuming you're doing matrix-vector multiplication and not matrix-matrix (which might be different with MoE models like Mixtral 8x7B?)
Ampere and newer also have native sparsity support which means that you can skip over the zeroes 'for free', which Nvidia uses to market double the TFlops, which is kind of misleading imo. But the 195 TFlops are even before sparsity is included!
I'm not sure if the 93 TFlops (120 with boost clocks) on AMD are with FP16 or FP32 accumulation, as with FP32 accumulation the 4080 slows down significantly and gets much closer with 97.5 TFlops.
Intel Xe-HPG (used in the Arc A cards) also offers very aggressive matrix acceleration via XMX, with 137.6 FP16 TFlops at base clock, vs. 17.2 FP32 TFlops.