It may be useful to mention that modern architectures often have vectorized inst...

dragontamer · on Nov 3, 2020

AMD Vega has "V_RSQ_F32" (Reciprocal Square Root"), and NVidia has rsqrt.approx.f32. ARM has vrecpsq_f32.

So all the major platforms, CPUs and GPUs, implement the fast reciprocal square root to decent amounts of accuracy, without any need of bit-twiddling anymore.

apocalypses · on Nov 2, 2020

Though worth noting only for specific AVX-512 CPUs.

Though you can still have vectorised support for regular precision for AVX2.

jedbrown · on Nov 2, 2020

_mm256_rsqrt_ps is 12-bit accuracy in AVX.

https://software.intel.com/sites/landingpage/IntrinsicsGuide...

dheera · on Nov 3, 2020

I imagine the bit hacks are still useful for microcontroller applications though.

fluffything · on Nov 3, 2020

I prefer architectures that have a vector instruction for computing one or two Newton iterations instead.

That way you can quickly get the precision that you want ;)

dragontamer · on Nov 3, 2020

vrsqrtps executes once-per-cycle throughput on Skylake / Icelake.

You can always compute newton's method afterwards to improve the accuracy. But getting the maximum accuracy from one cycle is probably best.

fluffything · on Nov 3, 2020

You can create an instruction to perform X newton iterations in 1 cycle if you want, and you can pick X to give you 14-bit precision.

With such an instruction, you could exponentially increase the precision in 2 cycles by just using the same instruction twice.

You can't do that on Intel's hardware. As you mention, you'd need to roll your own multi-SIMD-instruction rsqrt newton iteration complex loop, and use it after the first SIMD call.

That's really sad. There is hardware to perform Newton iterations on Intel CPUs, that's how that instruction is implemented, but the ISA only exposes this hardware via the "do a 14-bit rsqrt operation", which means that you can't really use it to increase precision if that does not suffice for your app.