It may be useful to mention that modern architectures often have vectorized instructions like vrsqrt14ps (accessible via the _mm512_rsqrt14_ps intrinsic) that provides a 14-bit approximation (there are more accurate variants too) in every lane with an inverse throughput of 2. These are faster than the integer bit hacks.
AMD Vega has "V_RSQ_F32" (Reciprocal Square Root"), and NVidia has rsqrt.approx.f32. ARM has vrecpsq_f32.
So all the major platforms, CPUs and GPUs, implement the fast reciprocal square root to decent amounts of accuracy, without any need of bit-twiddling anymore.
You can create an instruction to perform X newton iterations in 1 cycle if you want, and you can pick X to give you 14-bit precision.
With such an instruction, you could exponentially increase the precision in 2 cycles by just using the same instruction twice.
You can't do that on Intel's hardware. As you mention, you'd need to roll your own multi-SIMD-instruction rsqrt newton iteration complex loop, and use it after the first SIMD call.
That's really sad. There is hardware to perform Newton iterations on Intel CPUs, that's how that instruction is implemented, but the ISA only exposes this hardware via the "do a 14-bit rsqrt operation", which means that you can't really use it to increase precision if that does not suffice for your app.
https://software.intel.com/sites/landingpage/IntrinsicsGuide...