Matrix inversion requires taking the reciprocal of a determinant. I don't have t...

Const-me · on July 19, 2020

Here’s Intel versus AMD relative error of RCPPS instruction: http://const.me/tmp/vrcpps-errors-chart.png AMD is Ryzen 5 3600, Intel is Core i3 6157U.

Over the complete range of floats, AMD is more precise on average, 0.000078 versus 0.000095 relative error. However, Intel has 0.000300 maximum relative error, AMD 0.000315.

Both are well within the spec. The documentation says “maximum relative error for this approximation is less than 1.5*2^-12”, in human language that would be 3.6621E-4.

Source code that compares them by creating 16GB binary files with the complete range of floats: https://gist.github.com/Const-me/a6d36f70a3a77de00c61cf4f6c1...

dvt · on July 19, 2020

This is great (I was actually just toying with the _mm_rcp_ss intrinsic)! It would also indicate that RCPSS is not the culprit.

Const-me · on July 19, 2020

> It would also indicate that RCPSS is not the culprit.

I’m not sure about that. I think it might be. Despite both are within specs, the results are numerically different.

Modern DirectXMath doesn‘t use these approximated instruction for inverting matrices, it’s open source, there: https://github.com/microsoft/DirectXMath/blob/83634c742a85d1...

dvt · on July 19, 2020

> Despite both are within specs, the results are numerically different.

True, but it's quite common that floating point ops throwaway bits (i.e. beyond the epsilon) will be numerically different between vendors.

> Modern DirectXMath doesn‘t use these approximated instruction for inverting matrices, it’s open source, there: https://github.com/microsoft/DirectXMath/blob/83634c742a85d1....

Good find, neither does Wine, actually: https://doxygen.reactos.org/dc/dd8/d3dx9math_8h.html#a3870c6...

Would be interesting to see if this bug also happens on Linux.

Const-me · on July 19, 2020

BTW, there’s much less reason to use these approximations on modern hardware.

Divps instruction (32-bit float divide, the precise version) was relatively slow back then, e.g. on AMD K8 it had 18-30 cycles both latency and throughput, 21 cycles on AMD Jaguar, Core 2 Duo did in 6-18 cycles.

Fortunately, they fixed the CPUs. Skylake has 11 cycles latency of that instruction, Ryzen 10 cycles.

It’s similar for square root. Modern CPUs compute non-approximated 32-bit sqrt(x) in 9-12 cycles (Ryzen being the fastest at 9-10 but Skylake ain’t bad either at 12 cycles), old CPUs were spending like 20-40 cycles on that.

Const-me · on July 20, 2020

> it's quite common that floating point ops throwaway bits (i.e. beyond the epsilon) will be numerically different between vendors.

I think these implementation-specific shenanigans only applied to legacy x87 instructions which used 80-bit registers. SSE and AVX instructions operate on 32 or 64-bit floats. The math there, including rounding behavior, is well specified in these IEEE standards.

I’ve made a small test app: https://github.com/Const-me/SimdPrecisionTest/blob/master/ma...

It abuses AES instructions to generate long pseudo-random sequence of bits, then re-interprets these bits as floats, does some FP math on these numbers, and saves the output.

I’ve tested addition, multiplication, FMA, and float to integer conversions. All 4 output files, 1 GB / each, are bitwise equal between AMD desktop and Intel laptop.

dvt · on July 20, 2020

> It abuses AES instructions to generate long pseudo-random sequence of bits, then re-interprets these bits as floats, does some FP math on these numbers, and saves the output.

Nice work! I'd be extremely curious to see if this still holds on intrinsics like `_mm512_exp2a23_round_ps` or `_mm512_rsqrt14_ps` (I'd wager it probably won't).

Const-me · on July 21, 2020

I don’t have AVX512 capable hardware in this house. Lately, I mostly develop for desktop and embedded. For both use cases AVX512 is safe to ignore, as the market penetration is next to none.

creato · on July 20, 2020

> True, but it's quite common that floating point ops throwaway bits (i.e. beyond the epsilon) will be numerically different between vendors.

This is a common sentiment, and it is perhaps a helpful way to look at things if you don't want to dig into the details (even if it's not quite right).

But it's worth understanding the magnitude of errors. rcpps will be 3-4 orders of magnitude "more wrong" compared to the typical operation (if you view the "epsilon" of most operations to be the error after rounding). Or said another way: it would take the cumulative error from many thousands of adds and multiplies to produce the same error as one rcpps operation.

acqq · on July 20, 2020

Related, LLVM developers evaluated the use of these instructions too:

https://bugs.llvm.org/show_bug.cgi?id=21385

On your graph, what’s red and what’s blue and what is on x axis?

Const-me · on July 20, 2020

AMD is red.

X axis is source value, I've extracted a small subset of the data to graph in Excel, there: https://gist.github.com/Const-me/a6d36f70a3a77de00c61cf4f6c1...

Also, I've just recalled Y is absolute error not relative; that's why it follows shape of 1/x.

acqq · on July 20, 2020

> I've extracted a small subset of the data to graph in Excel

Thanks. If you still have the data (or if you can regenerate it) it would actually be possible to make a few small graphs that would cover the whole set:

Basically, the idea is to have x points as much as there are different exponents, which in 32-bit floats is at most 256. Then on the y axis one draws the number of bits of maximum distance between the really correct value of the mantissa and the calculated mantissa in the whole interval of one exponent. Those would allow comparing the implementations of Intel and AMD separately, and these graphs are what I'd be very interested to see. So the idea is to find the maximums in an interval, and there are then a limited number of the intervals. Only if such graphs match between AMD and Intel it would be interesting to compare inside of the intervals, but differences on that level would be these that I'd expect would be the obvious ones making the most problems, where the results like in the article (full black instead of shadow) wouldn't be surprising.

For that don't have to juggle with huge files, it's only 256 values per each CPU in that pass that are to be compared.

Const-me · on July 21, 2020

Can re-generate the data files, it’s couple pages of code to copy-paste from my gist and compile, but I’m not sure how to implement what you wrote.

The results are not guaranteed to have same exponent, e.g. 1 / 0.499999 can be either 1.999999 or 2.0000001, both are correct within the precision, but have different exponents.

acqq · on July 21, 2020

The algorithm would be approximately:

1) Use the binary representation of the numbers! To do so, cast the resulting float to the unsigned integer, then use bit masks and shifts to extract the exponent and mantissa. Note that the leading 1 is not explicit but implicit in the IEEE format unless it's a denormal number (so make it always explicit during the extraction).

2) use the exponent of the correct result as the "interval" reference.

If the exponents are the same, do a subtraction of the smaller from the bigger mantissa, that's the absolute "distance" between the two numbers -- the goal is to find what is the biggest absolute distance in which interval.

3) If one of the 2 values that are compared has different exponent, they can be converted to the same by a bit shift. Shift the mantissa of the one with the bigger exponent left accordingly. Again do the subtraction and use the result as the absolute distance. The goal is to figure out the biggest absolute distance in each interval (maintaining a maximum for each interval).

In short, think binary, not decimal, and measure using these values. Binary are only values that matter, decimal representation doesn't necessarily represent the exact values of bits.

Examples:

float 1.0 == unsigned 0x3f800000 here exponent is 127 == 2^0 and mantissa 0 with implicit 1 at the start i.e. explicit: 0x800000

float 0.999999940395355224609375 = unsigned 0x3f7fffff here exponent is 126 == 2^-1 and mantissa explicit: 0xffffff

The absolute distance between these two numbers is 1 (adding one to the lowest bit of mantissa of the smaller number would result the higher number 0xffffff + 1 = 0x1000000, the later is the mantissa adjusted to the same exponent of the smaller number (0x800000 << 1) ). If the "correct" number was 0x3f800000 and even if the shift was needed to calculate the absolute distance, the interval is still 0 (i.e. 0 is the x axis value, as its exponent was 2^0 i.e. 127, and the value to be plotted is on y is 1 until a bigger distance occurs).

For more examples of the format you can play here:

https://www.h-schmidt.net/FloatConverter/IEEE754.html

Also note that a few exponents are special, meaning infinity or NaN. Whenever the "correct" answer is not a NaN or infinity but the "incorrect" is, that should be treated specially, if it actually happens.

Const-me · on July 21, 2020

Results from Zen 2: https://github.com/Const-me/SimdPrecisionTest/blob/master/vr... Results from Skylake: https://github.com/Const-me/SimdPrecisionTest/blob/master/vr...

Total, exact, less, greater columns have total count of floats in a bucket. Sum of the “Total” gives 2^32, the total count of unique floats.

Computing max bit that’s different is too slow for the use case, neither SSE nor AVX have vector version of BSR instruction. Instead, I’m re-interpreting floats as integers and computing difference of the integers. maxLess, maxGreater, and maxAbs columns have that maximum error, measured as count of float values of the error. The value 4989 means the mantissa had like 12-13 lowest bits incorrect.

Source code is there: https://github.com/Const-me/SimdPrecisionTest/blob/master/rc... Not particularly readable because I’ve used AVX2 and OpenMP, however this way it takes less than a second on desktop, and maybe 1.5 seconds on a laptop to process all of these floats.

acqq · on July 22, 2020

Based on how I understand the numbers in the tables, it looks to me like both implementations behave the same in the critical points, and AMD obviously achieves less distance from the "exact", but has some kind of truncating instead of rounding logic which changes the distribution of the approximations, and you discovered that with counting "less" and "greater." Congratulations!

acqq · on July 21, 2020

Really fascinating, thanks!

creato · on July 20, 2020

Funny, I think this graph is from a thread where I got roasted for arguing that differences in CPU implementations meant that Intel (or anyone else really) would need to be careful about shipping SIMD software on processors they don't test.

Dylan16807 · on July 20, 2020

Rightfully so. Because they were replacing the SIMD code with non-SIMD code that was similarly untested, with the side effect of crippling performance on their competitor's chips. That's a really bad kind of "being careful".

creato · on July 20, 2020

I'm not aware of instructions that have this implementation defined behavior risk in non-SIMD code, and you can be sure that Intel has tested this. If Microsoft had shipped a non-SIMD version of D3DXMatrixInverse, this bug likely could not have existed (or it would have been caught on the developer's machines).

In this case (games), they probably made the right tradeoff despite the bug. But in general? I don't want to rehash the argument, I'm just really, really glad I'm not the one flipping that switch (or subject to the hordes of HNers who think I'm a monster for not flipping it).

Const-me · on July 20, 2020

Correct, I've scavenged the comment, and the data+code, from an old thread here on HN.

nwallin · on July 19, 2020

In case anyone is wondering why RCP[PS]S and RSQRT[PS]S are specified to give only a relatively small number of digits of accuracy (I think only 7 or 14 bits?) it's because they use the fast inverse square root algorithm in hardware.

The IEEE754 floating point representation gives you an easy way to roughly approximately convert between the log2 of it. The exponent gives you exactly the integer log2 of the number, and the mantissa gives you a fractional linear term that you can drop onto the integer part to make it closer to the actual log2 version. This lets you do some fun things fairly easily.

x -> log2(x) -> -log2(x) -> 1/x

x -> log2(x) -> -0.5 log2(x) -> 1/sqrt(x)

x -> log2(x) -> 0.5 log2(x) -> sqrt(x)

The lossiness of the conversion limits your precision, but there's a lot of times you don't give a shit. So the instructions are still valuable even if they're only approximately correct. For the reciprocal case, it will do something like this:

https://godbolt.org/z/n6M4az

I'm not sure where the nondeterminism comes in between AMD and Intel. As you can see, with the reciprocal, there's no magic constant like there is with inverse square root. Maybe they're fudging something, or have a different form of Newton's method, I dunno.

cwzwarich · on July 19, 2020

Approximations implemented in HW generally use different techniques than ones in SW. They typically start with a lookup table, possibly followed by some other steps. Someone reverse-engineered the old AMD 3DNow implementations here:

https://pdf.sciencedirectassets.com/272990/1-s2.0-S157106610...

nwallin · on July 20, 2020

It is utterly bizarre to me that a table lookup is preferable to integer subtraction. The f32 registers are already hooked up to i32 ALUs. A table lookup requires die space, while mine requires none, and the i32 subtraction is already heavily optimized.

An initial thought I had that was that their table lookup is to find a magic constant that might nudge the result of the final value in the right direction. (instead of the magic constant 0x7F000000 that my code boils down to.) But that doesn't seem to be what my Kaby Lake is doing.

This is gonna bug me all night, I know it.

Const-me · on July 20, 2020

> The f32 registers are already hooked up to i32 ALUs.

Technically they are. Practically many CPUs, especially older one, have non-trivial latency cost for passing a value between FP and integer ALUs, like couple of cycles each direction.

Ever wondered why there're 3 sets of bitwise instructions, e.g. pandn, andnps, andnpd which do exactly same thing with the bits in these registers? Now you know.

creato · on July 20, 2020

They're basically the same thing, just the integer aliasing approach is a lookup table with one element in it, while an actual lookup table will have more, giving better precision for roughly the same cost (or even better, since you don't need to load the constant into a register).

withinrafael · on July 19, 2020

Yep, RCPSS is present in d3dx9_31!intelsse2_D3DXMatrixInverse.

ajross · on July 19, 2020

That seems likely. Per the SDM the required precision is to within 1.5*2^-12, which is quite low and allows a lot of slop in the implementation.

One possibility is that there's an accidental equality test on the path of a runtime (AMD) value vs. a value that is computed at compile time (presumably on an Intel CPU).

graycat · on July 19, 2020

deleted

wtallis · on July 19, 2020

I think you've lost sight of the fact that this entire discussion is constrained to the 4x4 matrices used in computer graphics. How to best invert a matrix of a particular small fixed size is a very different question from how to best invert matrices in general.

graycat · on July 19, 2020

deleted

TJSomething · on July 19, 2020

While true, there is a reason that the GLSL standard includes builtin data types for matnxm for all n,m in {2, 3, 4}. The single most common matrix operations in computer graphics are related to linear transforms on color spaces and geometry. Both of those max out at transforms on 4D vectors, for RGBA and 3D projective geometry.