BTW, there’s much less reason to use these approximations on modern hardware.
Divps instruction (32-bit float divide, the precise version) was relatively slow back then, e.g. on AMD K8 it had 18-30 cycles both latency and throughput, 21 cycles on AMD Jaguar, Core 2 Duo did in 6-18 cycles.
Fortunately, they fixed the CPUs. Skylake has 11 cycles latency of that instruction, Ryzen 10 cycles.
It’s similar for square root. Modern CPUs compute non-approximated 32-bit sqrt(x) in 9-12 cycles (Ryzen being the fastest at 9-10 but Skylake ain’t bad either at 12 cycles), old CPUs were spending like 20-40 cycles on that.
> it's quite common that floating point ops throwaway bits (i.e. beyond the epsilon) will be numerically different between vendors.
I think these implementation-specific shenanigans only applied to legacy x87 instructions which used 80-bit registers. SSE and AVX instructions operate on 32 or 64-bit floats. The math there, including rounding behavior, is well specified in these IEEE standards.
It abuses AES instructions to generate long pseudo-random sequence of bits, then re-interprets these bits as floats, does some FP math on these numbers, and saves the output.
I’ve tested addition, multiplication, FMA, and float to integer conversions. All 4 output files, 1 GB / each, are bitwise equal between AMD desktop and Intel laptop.
> It abuses AES instructions to generate long pseudo-random sequence of bits, then re-interprets these bits as floats, does some FP math on these numbers, and saves the output.
Nice work! I'd be extremely curious to see if this still holds on intrinsics like `_mm512_exp2a23_round_ps` or `_mm512_rsqrt14_ps` (I'd wager it probably won't).
I don’t have AVX512 capable hardware in this house. Lately, I mostly develop for desktop and embedded. For both use cases AVX512 is safe to ignore, as the market penetration is next to none.
> True, but it's quite common that floating point ops throwaway bits (i.e. beyond the epsilon) will be numerically different between vendors.
This is a common sentiment, and it is perhaps a helpful way to look at things if you don't want to dig into the details (even if it's not quite right).
But it's worth understanding the magnitude of errors. rcpps will be 3-4 orders of magnitude "more wrong" compared to the typical operation (if you view the "epsilon" of most operations to be the error after rounding). Or said another way: it would take the cumulative error from many thousands of adds and multiplies to produce the same error as one rcpps operation.
True, but it's quite common that floating point ops throwaway bits (i.e. beyond the epsilon) will be numerically different between vendors.
> Modern DirectXMath doesn‘t use these approximated instruction for inverting matrices, it’s open source, there: https://github.com/microsoft/DirectXMath/blob/83634c742a85d1....
Good find, neither does Wine, actually: https://doxygen.reactos.org/dc/dd8/d3dx9math_8h.html#a3870c6...
Would be interesting to see if this bug also happens on Linux.