Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fast Inverse Square Root (timmmm.github.io)
294 points by timhh on Nov 2, 2020 | hide | past | favorite | 96 comments


It may be useful to mention that modern architectures often have vectorized instructions like vrsqrt14ps (accessible via the _mm512_rsqrt14_ps intrinsic) that provides a 14-bit approximation (there are more accurate variants too) in every lane with an inverse throughput of 2. These are faster than the integer bit hacks.

https://software.intel.com/sites/landingpage/IntrinsicsGuide...


AMD Vega has "V_RSQ_F32" (Reciprocal Square Root"), and NVidia has rsqrt.approx.f32. ARM has vrecpsq_f32.

So all the major platforms, CPUs and GPUs, implement the fast reciprocal square root to decent amounts of accuracy, without any need of bit-twiddling anymore.


Though worth noting only for specific AVX-512 CPUs.

Though you can still have vectorised support for regular precision for AVX2.


_mm256_rsqrt_ps is 12-bit accuracy in AVX.

https://software.intel.com/sites/landingpage/IntrinsicsGuide...


I imagine the bit hacks are still useful for microcontroller applications though.


I prefer architectures that have a vector instruction for computing one or two Newton iterations instead.

That way you can quickly get the precision that you want ;)


vrsqrtps executes once-per-cycle throughput on Skylake / Icelake.

You can always compute newton's method afterwards to improve the accuracy. But getting the maximum accuracy from one cycle is probably best.


You can create an instruction to perform X newton iterations in 1 cycle if you want, and you can pick X to give you 14-bit precision.

With such an instruction, you could exponentially increase the precision in 2 cycles by just using the same instruction twice.

You can't do that on Intel's hardware. As you mention, you'd need to roll your own multi-SIMD-instruction rsqrt newton iteration complex loop, and use it after the first SIMD call.

That's really sad. There is hardware to perform Newton iterations on Intel CPUs, that's how that instruction is implemented, but the ISA only exposes this hardware via the "do a 14-bit rsqrt operation", which means that you can't really use it to increase precision if that does not suffice for your app.


A little self-promotion: I've written a blog post to answer the question that this one ends with (how to optimize divisions by constant integers): https://rubenvannieuwpoort.nl/posts/division-by-constant-uns...

https://ridiculousfish.com/blog/posts/labor-of-division-epis... and https://ridiculousfish.com/blog/posts/labor-of-division-epis... also explain this (and are probably better written than my blogpost).


Off-topic comment, I really like the clutter-free look of your site, with the latex to html generator.

For anyone curious like I was:

https://github.com/rubenvannieuwpoort/static-site-generator


I totally agree! It's easy to read and has a bit of a "paper" feel to it, but at the same time the sans-serif font makes it still feel "modern" and like a website.


Very nice page!

Minor nit, there's a typo here in the power-of-two example:

    uint divide(uint n) {
        return n << p;
    }
the shifts should be to the right, obviously.


Thanks! I have a completely rewritten version in the pipeline, where I use easier versions of the theorems/proofs and fix some mistakes (in fact, the proof of lemma 1 is quite nonsensical if you look closely).


> Games calculate square roots and inverse square roots all the time to find the lengths of vectors

A trick here is that one often doesn't have to do the square root. For instance, if you want something to happen when an object is 5 units away from another object, it's normal to do

    if sqrt( (x2-x1)^2 - (y2-y1)^2 ) < 5 then ...
but instead you can do

    if (x2-x1)^2 - (y2-y1)^2 < 5^2 then ...
trading a sqrt for a squaring. And often the square of the distance can be cached or even a constant. So a lot of libraries (for instance vector library from libgdx[0]) contain a dst function, but also a dst2 function that skips the squaring to save some cycles when not needed.

[0]: https://libgdx.badlogicgames.com/ci/nightlies/docs/api/com/b...


Sometimes you can even get away by dropping the square completely:

  if (x2-x1) - (y2-y1) < (x4-x3) - (y4-y3) then ...
I once used this in a path tracer to speed things up a little. The results where less accurate but sometimes this can be used as trade-off.


https://gist.github.com/izackp/5ae96a740678946d8763c198b0e54... I have a list of functions for distance calculation with metrics on speed and accuracy. If anyone would find it useful.


I vaguely recall seeing in Graphics Gems an additional coefficient placed there that's been computed to minimize the error of this metric vs. Euclidean one.


This is an explanation of the fast inverse square root that I hope is easy to understand. I also managed to improve on it a little bit.


Thanks for this. I have one nit. You show `q = 1598029824 - u/2;` as being identical to `q = 0x5F400000 - u >> 1;`, but every language I know of uses a different order of operations, giving different results. It might be clearest to provide parentheses in the second case.


Interestingly enough it seems to do the shift first in Swift


All well-known programming languages will do the shift first.


I looked at about 20 different "popular" languages and I believe that only 2 (Go and Swift) of them have shifts at a higher precedence than addition.

I think I'm missing a reference/joke =/


Or this might just be another confident but baseless HN commenter assertion.


C, C++, Java, and Rust all have shift as lower precedence than addition/subtraction. Edit: and Javascript, Python.


You can add C# as well.


Use brackets to be explicit about your expected order of operations.


Oops! I added brackets, thanks!


You mention the wiki page is badly written, and I have the same feeling whenever I read about anything mathematics related on it. But... it's written by volunteers, so I'll take what I can get. Since you obviously have the talent for explaining things, do you think the wiki page could be edited and improved for clarity?


I have thought about it, but I always worry about making large edits to Wikipedia pages that I'll put a lot of work into it and then someone will come along and just revert it. Maybe that fear is unfounded - I'll put it on my (very long) todo list!


Very nice piece! Also, minuend and subtrahend are the standard terms.


This is a great post! The setup could be clearer: I found it a little difficult to follow (during “a real number x … this value which I will denote f … treat the 31-bit value as an unsigned integer u”) what each of the named variables was intended to refer to, and formatting the 31-bit number as 32 bits threw me off too. Nothing insurmountable but I could feel my brain stumbling before I got to the clever part.


Computers do not have _real_ number systems, only _rational_ number systems and rational approximations of real numbers.


There are plenty of libraries for computable numbers [1], which sit between the rationals and reals

[1] https://en.wikipedia.org/wiki/Computable_number


...which (naturally) can't determine if some number x is equal to y or not; that alone makes using constructive real numbers challenging and the usage is virtually non-existent. I believe the most-known software using them is the Android calculator since 6.0 and Hans Boehm (of the gc fame) documented various issues encountered [1].

[1] https://dl.acm.org/doi/fullHtml/10.1145/2911981


Nice, i was looking at this, but for sqrt (basically the same integer trick)


> [Fixed point arithmetic] has some niche uses but isn't commonly used

It is very common in the embedded world and in hardware. I routinely program a CPU which has no FPU, so any arithmetic will have to be done in fixed point.

It's perfectly possible to get work done with fixed point arith, it just requires a bit more thinking through so you don't run out of bits. Usually, I write unit tests where I compile my code for host architecture and compare the fixed point result to host's floating point. Then I can set a precise error margin for my XP approximation.


Another 'niche' use case is doing DSP calculations on an FPGA. If you can accept the trade-off of some degree of error with fixed-point, you can have very fast and heavily parallelized implementations.


To me "inverse square root" means "square" so it makes the title of this kind of funny.


Yeah, shouldn't it be 'fast reciprocal square root'?


Agreed. Inverse is not a synonym for reciprocal, but it's too late, the name has stuck.


Here, "inverse" is an abbreviation of "multiplicative inverse" (aka. the reciprocal). Granted, shortening it to just "inverse" is misleading, but the name has stuck.


No, it's completely different than simple squaring. The inverse square root is undefined for negative numbers ;)


That's not right. The so-called fast inverse square root function computes the reciprocal of the square root, not the square.

You're not the only one to be confused by the name. An awful choice of name, in my opinion.


Here's another article (from 2012) describing the same calculation:

http://h14s.p5r.org/2012/09/0x5f3759df.html


The original article has the same error in the title. x^(-1/2) is the reciprocal square root. The fault is with math notation where rasing x to a negative power, i.e. x^-a means 1/(x^a); while inverse function F(x) is written F^-1(x). The inverse square root would be just square. Edit: I saw it called multiplicative inverse of the square root


FYI: Type punning like this doesn't work on all compilers

i = * ( long * ) &y; // evil floating point bit level hacking

Learned this recently with Arm AC6.

[Also, this kind of genius analytic approximation of hot functions that do math makes me all tingly]


Would a union do the trick?


That's the Linux way to do it

    union { int i; float f } u { .f = 1.23f };
    int i = u.i; 
Another way is a memcpy, which I believe is the most defined way to do type punning

    int i; 
    float f; 
    memcpy(&i, &f, 4);
But you also have to assume the size of those primitive types but that's pretty safe in modern C/C++.


In C++, type punning through pointers and unions is undefined behavior. Even `reinterpret_cast<T>` isn’t allowed because of aliasing (IIRC). The only “defined” way to do type punning is a memcpy. A compiler targeting something like x86 would optimize out the memcpy.

For more information, see the C++20 final draft[0§7.6.1.9]

[0]: https://isocpp.org/files/papers/N4860.pdf


I really wish C and C++ had an explicit type punning operator. They are systems programming languages and this is very often needed.


C++20 introduces bit_cast to do this. The implementation of bit_cast is semantically equivalent to using memcpy.


Casting pointers in C used to work back in the day. It still works in most cases (except with aggressive optimization options), but it may may make some people uneasy.


There's a Torvalds rant about this specific UB and it's how they do it in the Linux kernel.


Linux is written in C. In C99 and later type punning trough unions is well defined operation. No issues there.

It’s C++ that’s lacking this feature. Not C.


I always thought the memcopy is optimized out, but can you explain how in int i; float f; memcopy (&i, &f,4) the memcopy can be optimized out? Probably I am misunderstanding the statement.


The C abstract machine presents i and f as having separate locations in memory, but a compiler that knows that memcopy does can avoid actually saving the target machine registers to the stack, then avoid even copying values between registers.


Thanks. Does that really happen?


Yes.


The compiler is not beholden to the standard library but the standard. So all modern compilers come with a bit of knowledge of how standard library functions like memcpy are supposed to behave and as long as the visible effect is the same it's allowed to do anything it wants. So instead of calling the function memcpy for a size of 4, it can e.g. just use a mov instruction to move the value from a float to an integer register. [1] It does additional analysis on how the values are used and maybe it doesn't even need to move the value at all, but that's the gist of it.

[1] https://gcc.godbolt.org/z/x4P7jE


A good example of this is the string functions in the C stdlib. I’ve reverse engineered some programs where the compiler used x86’s built-in “string” instructions instead of calling out to, say, `strlen`.


Punning through the union is explicitly allowed in the GCC documentation which explains why it's reliable for use in the Linux kernel. See

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Typ...

> you also have to assume the size of those primitive types

This usually works though there's some situations where you could still be surprised. For instance if you're programming embedded processors with avr-gcc you'll have sizeof(double)==4


Do modern compilers optimize away the memcpy?


yes


Yes, that was the only "portable solution". I use quotes because we had to assert() the sizeof float with uint32_t in the code prologue (it is unlikely to be an issue now, but the code is 25 years old, so we stuck to the spirit of expecting compilers where C types could be a variety of sizes, not just the "conventional" sizes of today). Unions are defined, casting is very poorly defined, especially float<>int. There's a good link a few comments down.


Small nitpicks:

> But floating point numbers are always greater than LNS numbers

Unless I'm missing something, they are never smaller, but they can be equal (at powers of two).

> E.g. x1/2x^{1/2}x1/2, x−8x^{-8}x−8, x2x^2x2, though you probably wouldn't use it for positive exponents since you can just use multiplication to get an exact answer.

1/2 is positive. As is 1/3.


> they can be equal

Ha, funnily enough I did think of that, but I was trying to keep the explanation short and simple (and a bit hand-wavy).

> 1/2 is positive. As is 1/3

Oops, fixed, thanks!


The implementation can be more concisely written (and without reassigning variables) as

  float InvSqrt(float x)
  {
     long yl;
     float y;

     yl = 0x5f3759df - ((*(long *) &x) >> 1);
     y = *(float *) &yl;
     return y * (1.5F - (x * 0.5F * y * y));
  }


It is fun to read the source of old games for gems like the discussed `Q_rsqrt` function. However, one often wonders what the limits/assumptions/guarantees of such tricks are so I've blogged about SMT-based reasoning about such properties using `Q_rsqrt` as an example: https://bohlender.pro/blog/smt-based-optimisation-of-fast-in... -- might be interesting to those not too familiar with the verification business.


Copy of the original article that started all this

https://www.beyond3d.com/content/articles/8/


I wonder how it holds up to something like this on a modern processor and compiler:

    float rsqrt(float number) { return 1.0f / sqrtf(number); }


Wonder no longer!

https://godbolt.org/z/M4TKb7

While this may look disappointing, it should be clear that adding 5% (or even 0.1%) error to inverse square root calculations is not something compilers are in the business of.

Now, when you give the compiler more leeway (and -ffast-math is substantial leeway for anything less ephemeral than triangle normals on screen), you get much more interesting things:

https://godbolt.org/z/85zG5r

Turns out x86 has a (microcoded) instruction for this (and newer processors also have vectorized versions[1] of it). The compiler adds Newton-Rhapson for good measure.

[1]: https://uops.info/html-lat/KBL/VRSQRTSS_XMM_XMM_XMM-Measurem...


What was the speedup from this?

If only 1% of cpu time was spent on slow square root, and this sped it up by 100%, it would be barely worth it.


> and this sped it up by 100%

Over 1,000%.

x87 fsqrt took ~70 cycles on the Pentium MMX x87, x87 fdiv was ~39 cycles - so doing inverse square root on x87 was ~109 cycles. The fast inverse square root routine was in the vicinity of 10 cycles.

In Quake, each vertex needed to have its normal vector calculated, which resulted in... I dunno, 100,000 inverse square root calculations per second, assuming you're pushing 100,000 polygons per second. Assuming a 100MHz CPU, the x87 inverse square root calculations would consume 11% of the CPU cycles, the fast inverse square root routine would consume 1% of the CPU cycles.

https://en.wikipedia.org/wiki/X87#Performance


I don't know how heavily it was used in Quake but I'd bet quite a lot. Carmack didn't mess around optimizing things for no reason, he was (and is) pretty focused on spending time (coder and CPU) where it will do most good.

Also, a "mere" 1% speedup seems trivial for most general coding but squeezing 1% out of optimized game code is like blood from a stone. Add a bunch of similar tricks together and you can see 10-20% overall improvement which is huge.


I mean it's now ingrained in programming lore, but the reality is that Carmack has nothing to do with this and Carmack himself has never taken credit for it. The fast inverse square root dates back to the 80s and has its roots in SGI's systems. One of the developers who worked at SGI would eventually go on to work at ID developing Quake and worked on this optimization.


Oh, I didn't mean to imply Carmack invented it - he himself is pretty clear that he didn't. I just meant that incremental improvements in efficiency like this can add up surprisingly fast, and that once the low hanging fruit are taken, they're well worth doing.


> If only 1% of cpu time was spent on slow square root, and this sped it up by 100%, it would be barely worth it.

If you can do 10,000 things with 1% cpu time, and you are now able to do 5,000 more things, is it still not worth it?


In isolation, no. You could instead spend your time optimizing something else that's currently slower and speed that up instead.


I've always wondered too if it would be possible to log all the values it's actually calculating the inverse square root for, and just save them in a lookup table. Or even just the most common x% of them if you could reliably know for which numbers its most commonly running the calculation on.


Even if you pretend it’s a few values (which it’s not, there are a lot of floats), this would be dramatically slower due to the cache miss involved.


It clearly depends on the workload. If you have tons of 3d articles or polygons on a screen, one of the most repeated operations are the calculation of norms (x/sqrt(sum(x)). John carmack came to the fast inverse sqrt root explicitly for this speed advantage


It was first found in quake 3, which I believed used software rendering. Given that inverse square root is heavily used for lighting functions, and it was all done in CPU, it's reasonable to believe that this provided a non-negligible speedup overall.


iirc Quake 3 required a 3D accelerator and didn't support software rendering - but this was before graphics cards supported hardware T&L so you're right those lighting calcs would be performed on the CPU


One of my favourite hacks of all time (in part due to the comments)


God, this is such an old meme at this point. What's next, telling me how good lisp is?


It's a classic for sure, but it's still cool seeing a new take on it.


Obligatory XKCD: https://xkcd.com/1053/


Not having heard of something is always understandable, and it's pretty immature to make fun or "feign surprise" when someone doesn't know something you know.

But not searching to see if something had been posted before on a forum is a bit different[0]

[0] https://hn.algolia.com/?q=inverse+square+root


> If a story has not had significant attention in the last year or so, a small number of reposts is ok. Otherwise we bury reposts as duplicates.

https://news.ycombinator.com/newsfaq.html


Interestingly there is also an xkcd comic that references this tooic in the alt text https://xkcd.com/664/


XKCD is legit shit. This one, the standards one, the Bobby Tables one. And the password one. Things that are barely worth a chuckle are repeated as gospel.


Wow, bruh, who hurt you?


xkcd is as generic as they get. Randall goes idea -> generate numbers -> generate graph -> punchline and people milk it for as long as possible. It's like geek themed garfield. I wouldn't be surprised if he tries xkcd without stick people soon.


So it's generic and boring. But why are you really angry about it?


Why do you attribute anger to every criticism? This is getting weird.


Because it's everywhere.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: