It may be useful to mention that modern architectures often have vectorized instructions like vrsqrt14ps (accessible via the _mm512_rsqrt14_ps intrinsic) that provides a 14-bit approximation (there are more accurate variants too) in every lane with an inverse throughput of 2. These are faster than the integer bit hacks.
AMD Vega has "V_RSQ_F32" (Reciprocal Square Root"), and NVidia has rsqrt.approx.f32. ARM has vrecpsq_f32.
So all the major platforms, CPUs and GPUs, implement the fast reciprocal square root to decent amounts of accuracy, without any need of bit-twiddling anymore.
You can create an instruction to perform X newton iterations in 1 cycle if you want, and you can pick X to give you 14-bit precision.
With such an instruction, you could exponentially increase the precision in 2 cycles by just using the same instruction twice.
You can't do that on Intel's hardware. As you mention, you'd need to roll your own multi-SIMD-instruction rsqrt newton iteration complex loop, and use it after the first SIMD call.
That's really sad. There is hardware to perform Newton iterations on Intel CPUs, that's how that instruction is implemented, but the ISA only exposes this hardware via the "do a 14-bit rsqrt operation", which means that you can't really use it to increase precision if that does not suffice for your app.
I totally agree! It's easy to read and has a bit of a "paper" feel to it, but at the same time the sans-serif font makes it still feel "modern" and like a website.
Thanks! I have a completely rewritten version in the pipeline, where I use easier versions of the theorems/proofs and fix some mistakes (in fact, the proof of lemma 1 is quite nonsensical if you look closely).
> Games calculate square roots and inverse square roots all the time to find the lengths of vectors
A trick here is that one often doesn't have to do the square root. For instance, if you want something to happen when an object is 5 units away from another object, it's normal to do
if sqrt( (x2-x1)^2 - (y2-y1)^2 ) < 5 then ...
but instead you can do
if (x2-x1)^2 - (y2-y1)^2 < 5^2 then ...
trading a sqrt for a squaring. And often the square of the distance can be cached or even a constant. So a lot of libraries (for instance vector library from libgdx[0]) contain a dst function, but also a dst2 function that skips the squaring to save some cycles when not needed.
I vaguely recall seeing in Graphics Gems an additional coefficient placed there that's been computed to minimize the error of this metric vs. Euclidean one.
Thanks for this. I have one nit. You show `q = 1598029824 - u/2;` as being identical to `q = 0x5F400000 - u >> 1;`, but every language I know of uses a different order of operations, giving different results. It might be clearest to provide parentheses in the second case.
You mention the wiki page is badly written, and I have the same feeling whenever I read about anything mathematics related on it. But... it's written by volunteers, so I'll take what I can get. Since you obviously have the talent for explaining things, do you think the wiki page could be edited and improved for clarity?
I have thought about it, but I always worry about making large edits to Wikipedia pages that I'll put a lot of work into it and then someone will come along and just revert it. Maybe that fear is unfounded - I'll put it on my (very long) todo list!
This is a great post! The setup could be clearer: I found it a little difficult to follow (during “a real number x … this value which I will denote f … treat the 31-bit value as an unsigned integer u”) what each of the named variables was intended to refer to, and formatting the 31-bit number as 32 bits threw me off too. Nothing insurmountable but I could feel my brain stumbling before I got to the clever part.
...which (naturally) can't determine if some number x is equal to y or not; that alone makes using constructive real numbers challenging and the usage is virtually non-existent. I believe the most-known software using them is the Android calculator since 6.0 and Hans Boehm (of the gc fame) documented various issues encountered [1].
> [Fixed point arithmetic] has some niche uses but isn't commonly used
It is very common in the embedded world and in hardware. I routinely program a CPU which has no FPU, so any arithmetic will have to be done in fixed point.
It's perfectly possible to get work done with fixed point arith, it just requires a bit more thinking through so you don't run out of bits. Usually, I write unit tests where I compile my code for host architecture and compare the fixed point result to host's floating point. Then I can set a precise error margin for my XP approximation.
Another 'niche' use case is doing DSP calculations on an FPGA. If you can accept the trade-off of some degree of error with fixed-point, you can have very fast and heavily parallelized implementations.
Here, "inverse" is an abbreviation of "multiplicative inverse" (aka. the reciprocal). Granted, shortening it to just "inverse" is misleading, but the name has stuck.
The original article has the same error in the title. x^(-1/2) is the reciprocal square root. The fault is with math notation where rasing x to a negative power, i.e. x^-a means 1/(x^a); while inverse function F(x) is written F^-1(x). The inverse square root would be just square. Edit: I saw it called multiplicative inverse of the square root
In C++, type punning through pointers and unions is undefined behavior. Even `reinterpret_cast<T>` isn’t allowed because of aliasing (IIRC). The only “defined” way to do type punning is a memcpy. A compiler targeting something like x86 would optimize out the memcpy.
For more information, see the C++20 final draft[0§7.6.1.9]
Casting pointers in C used to work back in the day. It still works in most cases (except with aggressive optimization options), but it may may make some people uneasy.
I always thought the memcopy is optimized out, but can you explain how in
int i;
float f;
memcopy (&i, &f,4)
the memcopy can be optimized out? Probably I am misunderstanding the statement.
The C abstract machine presents i and f as having separate locations in memory, but a compiler that knows that memcopy does can avoid actually saving the target machine registers to the stack, then avoid even copying values between registers.
The compiler is not beholden to the standard library but the standard. So all modern compilers come with a bit of knowledge of how standard library functions like memcpy are supposed to behave and as long as the visible effect is the same it's allowed to do anything it wants. So instead of calling the function memcpy for a size of 4, it can e.g. just use a mov instruction to move the value from a float to an integer register. [1] It does additional analysis on how the values are used and maybe it doesn't even need to move the value at all, but that's the gist of it.
A good example of this is the string functions in the C stdlib. I’ve reverse engineered some programs where the compiler used x86’s built-in “string” instructions instead of calling out to, say, `strlen`.
> you also have to assume the size of those primitive types
This usually works though there's some situations where you could still be surprised. For instance if you're programming embedded processors with avr-gcc you'll have sizeof(double)==4
Yes, that was the only "portable solution". I use quotes because we had to assert() the sizeof float with uint32_t in the code prologue (it is unlikely to be an issue now, but the code is 25 years old, so we stuck to the spirit of expecting compilers where C types could be a variety of sizes, not just the "conventional" sizes of today). Unions are defined, casting is very poorly defined, especially float<>int. There's a good link a few comments down.
> But floating point numbers are always greater than LNS numbers
Unless I'm missing something, they are never smaller, but they can be equal (at powers of two).
> E.g. x1/2x^{1/2}x1/2, x−8x^{-8}x−8, x2x^2x2, though you probably wouldn't use it for positive exponents since you can just use multiplication to get an exact answer.
It is fun to read the source of old games for gems like the discussed `Q_rsqrt` function. However, one often wonders what the limits/assumptions/guarantees of such tricks are so I've blogged about SMT-based reasoning about such properties using `Q_rsqrt` as an example: https://bohlender.pro/blog/smt-based-optimisation-of-fast-in... -- might be interesting to those not too familiar with the verification business.
While this may look disappointing, it should be clear that adding 5% (or even 0.1%) error to inverse square root calculations is not something compilers are in the business of.
Now, when you give the compiler more leeway (and -ffast-math is substantial leeway for anything less ephemeral than triangle normals on screen), you get much more interesting things:
Turns out x86 has a (microcoded) instruction for this (and newer processors also have vectorized versions[1] of it). The compiler adds Newton-Rhapson for good measure.
x87 fsqrt took ~70 cycles on the Pentium MMX x87, x87 fdiv was ~39 cycles - so doing inverse square root on x87 was ~109 cycles. The fast inverse square root routine was in the vicinity of 10 cycles.
In Quake, each vertex needed to have its normal vector calculated, which resulted in... I dunno, 100,000 inverse square root calculations per second, assuming you're pushing 100,000 polygons per second. Assuming a 100MHz CPU, the x87 inverse square root calculations would consume 11% of the CPU cycles, the fast inverse square root routine would consume 1% of the CPU cycles.
I don't know how heavily it was used in Quake but I'd bet quite a lot. Carmack didn't mess around optimizing things for no reason, he was (and is) pretty focused on spending time (coder and CPU) where it will do most good.
Also, a "mere" 1% speedup seems trivial for most general coding but squeezing 1% out of optimized game code is like blood from a stone. Add a bunch of similar tricks together and you can see 10-20% overall improvement which is huge.
I mean it's now ingrained in programming lore, but the reality is that Carmack has nothing to do with this and Carmack himself has never taken credit for it. The fast inverse square root dates back to the 80s and has its roots in SGI's systems. One of the developers who worked at SGI would eventually go on to work at ID developing Quake and worked on this optimization.
Oh, I didn't mean to imply Carmack invented it - he himself is pretty clear that he didn't. I just meant that incremental improvements in efficiency like this can add up surprisingly fast, and that once the low hanging fruit are taken, they're well worth doing.
I've always wondered too if it would be possible to log all the values it's actually calculating the inverse square root for, and just save them in a lookup table.
Or even just the most common x% of them if you could reliably know for which numbers its most commonly running the calculation on.
It clearly depends on the workload. If you have tons of 3d articles or polygons on a screen, one of the most repeated operations are the calculation of norms (x/sqrt(sum(x)). John carmack came to the fast inverse sqrt root explicitly for this speed advantage
It was first found in quake 3, which I believed used software rendering. Given that inverse square root is heavily used for lighting functions, and it was all done in CPU, it's reasonable to believe that this provided a non-negligible speedup overall.
iirc Quake 3 required a 3D accelerator and didn't support software rendering - but this was before graphics cards supported hardware T&L so you're right those lighting calcs would be performed on the CPU
Not having heard of something is always understandable, and it's pretty immature to make fun or "feign surprise" when someone doesn't know something you know.
But not searching to see if something had been posted before on a forum is a bit different[0]
XKCD is legit shit. This one, the standards one, the Bobby Tables one. And the password one. Things that are barely worth a chuckle are repeated as gospel.
xkcd is as generic as they get. Randall goes idea -> generate numbers -> generate graph -> punchline and people milk it for as long as possible. It's like geek themed garfield. I wouldn't be surprised if he tries xkcd without stick people soon.
https://software.intel.com/sites/landingpage/IntrinsicsGuide...