In practice, there are many examples of the algorithm. For example:
Newton Rafson with SSE2 - can someone explain to me these 3 lines has an answer explaining the iteration used by one of Intel's examples .
For persistent analysis, let Haswell (which can FP mul on two execution ports, unlike previous projects), I will reproduce the code here (with one statement per line). See http://agner.org/optimize/ for bandwidth and command latency tables, and for documents on how to understand more background.
// execute (aka dispatch) on cycle 1, results ready on cycle 6 nr = _mm_rsqrt_ps( x ); // both of these execute on cycle 6, results ready on cycle 11 xnr = _mm_mul_ps( x, nr ); // dep on nr half_nr = _mm_mul_ps( half, nr ); // dep on nr // can execute on cycle 11, result ready on cycle 16 muls = _mm_mul_ps( xnr , nr ); // dep on xnr // can execute on cycle 16, result ready on cycle 19 three_minus_muls = _mm_sub_ps( three, muls ); // dep on muls // can execute on cycle 19, result ready on cycle 24 result = _mm_mul_ps( half_nr, three_minus_muls ); // dep on three_minus_muls // result is an approximation of 1/sqrt(x), with ~22 to 23 bits of precision in the mantissa.
There is plenty of room for other calculations if it is not part of the dependency chain. However, if the data for each iteration of your code depends on the data of the previous one, you may be better off with a delay of 11 cycles from sqrtps . Or even if each iteration of the loop is long enough that out-of-order execution cannot hide it all by overlapping independent iterations.
To get sqrt(x) instead of 1/sqrt(x) , multiply by x (and fixup if x can be zero, for example, by masking ( _mm_andn_ps ) with a CMPPS result of 0.0). The best way is to replace half_nr with half_xnr = _mm_mul_ps( half, xnr ); . This does not lengthen the chain of segments because half_xnr can begin in loop 11, but is not required until the end (loop 19). Same thing with available FMA: no increase in overall delay.
If 11 bits of accuracy is sufficient (without Newton iteration), Intel Optimization Guide (section 11.12.3) suggests using rcpps (rsqrt (x)), which is worse than multiplying by the original x, at least with AVX. This may save the movdqa instruction with a 128-bit SSE, but 256b rcpps is slower than 256b mul or fma. (And it allows you to add sqrt result to something for free using FMA instead of mul for the last step).
The SSE version of this cycle, which does not take into account any movement instructions, is 6 hours. This means that it should have a throughput in Haswell of one for 3 cycles (given that two execution ports can handle FP mul, and rsqrt is on the opposite port from FP add / sub). On SnB / IvB (and probably Nehalem), it should have a one-cycle throughput of 5 cycles , since mulps and rsqrtps compete for port 0. subps is on port1, which is not a bottleneck.
For Haswell, we can use FMA to combine the subtraction with mul. However, the divers / sqrt block does not have a width of 256b, therefore, unlike everything else, divps / sqrtps / rsqrtps / rcpps on ymm regs requires additional hours and additional cycles.
// vrsqrtps ymm has higher latency // execute on cycle 1, results ready on cycle 8 nr = _mm256_rsqrt_ps( x ); // both of can execute on cycle 8, results ready on cycle 13 xnr = _mm256_mul_ps( x, nr ); // dep on nr half_nr = _mm256_mul_ps( half, nr ); // dep on nr // can execute on cycle 13, result ready on cycle 18 three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three ); // -(xnr*nr) + 3 // can execute on cycle 18, result ready on cycle 23 result = _mm256_mul_ps( half_nr, three_minus_muls ); // dep on three_minus_muls
We save 3 cycles with FMA. This is offset by the use of 2-cyle-slower 256b rsqrt for a net gain of 1 cycle less latency (pretty good twice as much). SnB / IvB AVX is 24 s latency for 128b delay, 26c for 256b.
Version 256b FMA uses 7 uops total. ( VRSQRTPS is 3 uops, 2 for p0 and one for p1 / 5.) 256b mulps and fma are single-track instructions, and both can work on port 0 or on port 1. (p0 only on pre-Haswell), therefore, the bandwidth ability should be: one on 3c if the LLC engine sends the optimal execution ports. (i.e. shuffling uop from rsqrt always goes to p5, not p1, where it will occupy the mul / fma bandwidth.) As for overlapping with other calculations (and not just independent execution of itself), again this is pretty easy . Only 7 uops with a chain of 23-cycle segments leave plenty of room for other things to happen, while these uops sit in the reorder buffer.
If this is a step in a giant chain of segments where nothing happens (even an independent next iteration), then 256b vsqrtps is 19 delay cycles with a throughput of one for 14 cycles. (Haswell). If you still need feedback, then 256b vdivps also has a delay of 18-21 s, with one per bandwidth of 14 s. Therefore, for normal sqrt, the instruction has a lower latency. For a sqrt recipe, the approximation iteration is less latency. (And the FAR has more bandwidth if it overlaps with itself. If matching with other elements not sharing a unit, sqrtps not a problem.)
sqrtps can be a victory for bandwidth versus the Newton iteration of rsqrt + if it is part of the loop body with enough other undifferentiated and non-sqrt operations that go to the point that the division unit is not saturated.
This is especially true if you need sqrt(x) rather than 1/sqrt(x) , for example on Haswell with AVX2, the copy + arcsinh contour above the float array that fits into the L3 cache implemented as fastlog(v + sqrt(v*v + 1)) , works with approximately the same bandwidth with real VSQRTPS or with VRSQRTPS + iteration of Newton-Raphson. (Even with a very fast approximation for log () , so the overall loop body is about 9 FMA / add / mul / convert and 2 boolean operations, plus VSQRTPS ymm. Acceleration of use is only fastlog(v2_plus_1 * rsqrt(v2_plus_1) + v2_plus_1) , therefore it is not a bottleneck in memory bandwidth, but it can be a bottleneck in latency (therefore, out-of-order execution cannot use all parallelism of independent iterations).
Other comments
For completeness, there are no instructions for calculating half floats. You must convert on the fly when loading / storing using conversion instructions.
There is no rsqrtpd for double precision . Presumably, the thinking is that if you do not need full precision, you should use float first. This way you can convert to float and vice versa, then execute the same algorithm, but with pd instead of ps intrinsics. Or you could store your data as a float for a while. for example, convert two ymm double registers into one ymm single register.
Unfortunately, there is not a single instruction that takes two ymm registers doubles and outputs ymm singles. You should go ymm-> xmm twice, then _mm256_insertf128_ps one xmm to a high 128 of the other. But then you can feed this 256b ymm vector onto the same algorithm.
If you are going to convert back to double right, it may make sense to iterate rsqrt + Newton-Raphson on two 128-bit single-player registers separately. An additional 2 times for insertion / extraction, as well as an additional 2 uops for 256b rsqrt, begin to add up, not to mention the 3-stroke delay vinsertf128 / vextractf128 . The calculation will overlap, and both results will be ready for a couple of cycles. (Or 1 cycle separately, if you have a special version of your code for interleaving operations on 2 inputs simultaneously).
Remember that one precision has a smaller range of exponents than double, so the conversion can overflow to + Inf or underflow to 0.0. If you're not sure, definitely just use regular _mm_sqrt_pd .
With the AVX512F, there is _mm512_rsqrt14_pd( __m512d a) . With the AVX512ER (KNL, but not SKX or Cannonlake) , there is _mm512_rsqrt28_pd . _ps versions also exist. A 14-bit mantissa precision may be sufficient for use without Newton's iteration in other cases.
The Newton iteration will still not give you a properly rounded single-point float, as sqrt will normally do.