> it's quite common that floating point ops throwaway bits (i.e. beyond the epsilon) will be numerically different between vendors.
I think these implementation-specific shenanigans only applied to legacy x87 instructions which used 80-bit registers. SSE and AVX instructions operate on 32 or 64-bit floats. The math there, including rounding behavior, is well specified in these IEEE standards.
It abuses AES instructions to generate long pseudo-random sequence of bits, then re-interprets these bits as floats, does some FP math on these numbers, and saves the output.
I’ve tested addition, multiplication, FMA, and float to integer conversions. All 4 output files, 1 GB / each, are bitwise equal between AMD desktop and Intel laptop.
> It abuses AES instructions to generate long pseudo-random sequence of bits, then re-interprets these bits as floats, does some FP math on these numbers, and saves the output.
Nice work! I'd be extremely curious to see if this still holds on intrinsics like `_mm512_exp2a23_round_ps` or `_mm512_rsqrt14_ps` (I'd wager it probably won't).
I don’t have AVX512 capable hardware in this house. Lately, I mostly develop for desktop and embedded. For both use cases AVX512 is safe to ignore, as the market penetration is next to none.
I think these implementation-specific shenanigans only applied to legacy x87 instructions which used 80-bit registers. SSE and AVX instructions operate on 32 or 64-bit floats. The math there, including rounding behavior, is well specified in these IEEE standards.
I’ve made a small test app: https://github.com/Const-me/SimdPrecisionTest/blob/master/ma...
It abuses AES instructions to generate long pseudo-random sequence of bits, then re-interprets these bits as floats, does some FP math on these numbers, and saves the output.
I’ve tested addition, multiplication, FMA, and float to integer conversions. All 4 output files, 1 GB / each, are bitwise equal between AMD desktop and Intel laptop.