In ML workloads, usually the FP16 operations are matrix operations. On RDNA3, th...

Const-me · on Dec 18, 2023

You’re comparing general-purpose computations to a proprietary feature with limited applicability. For example, in my Mistral implementation the most expensive compute shader is handling matrices compressed with a custom codec unknown to tensor cores.

This is not an apples-to-apples comparison. We use FLOPS numbers to compare processors of different architectures because they’re the same FLOPS everywhere.

ColonelPhantom · on Dec 18, 2023

>This is not an apples-to-apples comparison.

Neither is FLOPS. RDNA3 FLOPS numbers are greatly inflated, because they are only applicable to fairly specific VOPD and WMMA instructions, the former being for two MADs at the same time, and the latter being applicable in the same cases as tensor cores.

Besides, it should be possible to use the tensor cores combined with the codec: you can use vector hardware to decode matrix values to fp16/fp32, then yeet those into the tensor cores. Although most of the time will probably be spent on the decoding part, assuming you're doing matrix-vector multiplication and not matrix-matrix (which might be different with MoE models like Mixtral 8x7B?)

Const-me · on Dec 19, 2023

RDNA3 FLOPS numbers are real. Shader compiler happily uses VOPD for quite a few things including FP32 FMA.

An independent researcher measured 62.9 FP32 TFlops (boost) in Vulkan on 7900XTX, see “Full GPU Throughput – Vulkan” section on that article: https://chipsandcheese.com/2023/01/07/microbenchmarking-amds...