Fast vector rsqrt and vice versa with SSE / AVX depending on accuracy

Question

Fast vector rsqrt and vice versa with SSE / AVX depending on accuracy

Suppose you need to calculate the inverse or inverse square root for packed floating point data. Both can be easily performed:

__m128 recip_float4_ieee(__m128 x) { return _mm_div_ps(_mm_set1_ps(1.0f), x); } __m128 rsqrt_float4_ieee(__m128 x) { return _mm_div_ps(_mm_set1_ps(1.0f), _mm_sqrt_ps(x)); }

This works fine, but slowly: according to the manual , they take 14 and 28 cycles on Sandy Bridge (bandwidth). Corresponding versions of AVX take almost the same time on Haswell.

Alternatively, you can use the following versions instead:

 __m128 recip_float4_half(__m128 x) { return _mm_rcp_ps(x); } __m128 rsqrt_float4_half(__m128 x) { return _mm_rsqrt_ps(x); }

They take only one or two cycles of time (bandwidth), which gives greater productivity. However, they are VERY approximate: they give a result with a relative error of less than 1.5 * 2 ^ -12. Given that a single-precision floating-point machine epsilon is 2 ^ -24, we can say that this approximation has about half the accuracy.

It seems that the Newton-Raphson iteration can be added to produce a single-precision result (maybe not as accurate as the IEEE standard is required), see GCC , ICC , LLVM discussions. Theoretically, the same method can be used for double precision values, creating half or one or double precision.

I am interested in the working implementations of this approach for both floating and double data types, and for all (half, single, double). Handling special cases (division by zero, sqrt (-1), inf / nan, etc.) is not required. In addition, it is not clear to me which of these routines will be faster than the trivial IEEE compilation decisions, and which will be slower.

Here are a few minor restrictions on the answers:

Use inline code examples. The build is compiler dependent, therefore less useful.
Use similar naming conventions for functions.
Implement routines using a single SSE / AVX register containing tightly packed float / double values as input. If there is a significant performance boost, you can also send procedures that accept multiple registers as input (two registers can be viable).
Do not send both versions of SSE / AVX if they are absolutely equal before changing _mm to _mm256 and vice versa.

Any performance evaluations, measurements, discussions are welcome.

ESSENCE

Here are the versions for single precision floating point numbers with one NR iteration:

 __m128 recip_float4_single(__m128 x) { __m128 res = _mm_rcp_ps(x); __m128 muls = _mm_mul_ps(x, _mm_mul_ps(res, res)); return res = _mm_sub_ps(_mm_add_ps(res, res), muls); } __m128 rsqrt_float4_single(__m128 x) { __m128 three = _mm_set1_ps(3.0f), half = _mm_set1_ps(0.5f); __m128 res = _mm_rsqrt_ps(x); __m128 muls = _mm_mul_ps(_mm_mul_ps(x, res), res); return res = _mm_mul_ps(_mm_mul_ps(half, res), _mm_sub_ps(three, muls)); }

The answer given by Peter Cordes explains how to create other versions and provides a detailed theoretical analysis of performance.

Here you can find all implemented solutions: recip_rsqrt_benchmark .

The obtained bandwidth results on Ivy Bridge are presented below. Only single-mode SSE implementations have been tested. Time spent on the task is provided in cycles per call. The first number is for half precision (without NR), the second for single precision (1 NR iteration), and the third for two NR iterations.

recip on float takes loops 1, 4 versus loops 7 .
rsqrt on float takes cycles 1, 6 versus 14 .
recip on double takes cycles 3, 6, 9 compared to 14 .
rsqrt on double cycles of cycles 3, 8, 13 versus cycles 28 .

Warning: I had to creatively combine the original results ...

+11

performance sse avx simd

stgatilov Jul 22 '15 at 6:15

source share

1 answer

Peter Cordes · Accepted Answer · 2015-07-22 09:38

In practice, there are many examples of the algorithm. For example:

Newton Rafson with SSE2 - can someone explain to me these 3 lines has an answer explaining the iteration used by one of Intel's examples .

For persistent analysis, let Haswell (which can FP mul on two execution ports, unlike previous projects), I will reproduce the code here (with one statement per line). See http://agner.org/optimize/ for bandwidth and command latency tables, and for documents on how to understand more background.

 // execute (aka dispatch) on cycle 1, results ready on cycle 6 nr = _mm_rsqrt_ps( x ); // both of these execute on cycle 6, results ready on cycle 11 xnr = _mm_mul_ps( x, nr ); // dep on nr half_nr = _mm_mul_ps( half, nr ); // dep on nr // can execute on cycle 11, result ready on cycle 16 muls = _mm_mul_ps( xnr , nr ); // dep on xnr // can execute on cycle 16, result ready on cycle 19 three_minus_muls = _mm_sub_ps( three, muls ); // dep on muls // can execute on cycle 19, result ready on cycle 24 result = _mm_mul_ps( half_nr, three_minus_muls ); // dep on three_minus_muls // result is an approximation of 1/sqrt(x), with ~22 to 23 bits of precision in the mantissa.

There is plenty of room for other calculations if it is not part of the dependency chain. However, if the data for each iteration of your code depends on the data of the previous one, you may be better off with a delay of 11 cycles from sqrtps . Or even if each iteration of the loop is long enough that out-of-order execution cannot hide it all by overlapping independent iterations.

To get sqrt(x) instead of 1/sqrt(x) , multiply by x (and fixup if x can be zero, for example, by masking ( _mm_andn_ps ) with a CMPPS result of 0.0). The best way is to replace half_nr with half_xnr = _mm_mul_ps( half, xnr ); . This does not lengthen the chain of segments because half_xnr can begin in loop 11, but is not required until the end (loop 19). Same thing with available FMA: no increase in overall delay.

If 11 bits of accuracy is sufficient (without Newton iteration), Intel Optimization Guide (section 11.12.3) suggests using rcpps (rsqrt (x)), which is worse than multiplying by the original x, at least with AVX. This may save the movdqa instruction with a 128-bit SSE, but 256b rcpps is slower than 256b mul or fma. (And it allows you to add sqrt result to something for free using FMA instead of mul for the last step).

The SSE version of this cycle, which does not take into account any movement instructions, is 6 hours. This means that it should have a throughput in Haswell of one for 3 cycles (given that two execution ports can handle FP mul, and rsqrt is on the opposite port from FP add / sub). On SnB / IvB (and probably Nehalem), it should have a one-cycle throughput of 5 cycles , since mulps and rsqrtps compete for port 0. subps is on port1, which is not a bottleneck.

For Haswell, we can use FMA to combine the subtraction with mul. However, the divers / sqrt block does not have a width of 256b, therefore, unlike everything else, divps / sqrtps / rsqrtps / rcpps on ymm regs requires additional hours and additional cycles.

 // vrsqrtps ymm has higher latency // execute on cycle 1, results ready on cycle 8 nr = _mm256_rsqrt_ps( x ); // both of can execute on cycle 8, results ready on cycle 13 xnr = _mm256_mul_ps( x, nr ); // dep on nr half_nr = _mm256_mul_ps( half, nr ); // dep on nr // can execute on cycle 13, result ready on cycle 18 three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three ); // -(xnr*nr) + 3 // can execute on cycle 18, result ready on cycle 23 result = _mm256_mul_ps( half_nr, three_minus_muls ); // dep on three_minus_muls

We save 3 cycles with FMA. This is offset by the use of 2-cyle-slower 256b rsqrt for a net gain of 1 cycle less latency (pretty good twice as much). SnB / IvB AVX is 24 s latency for 128b delay, 26c for 256b.

Version 256b FMA uses 7 uops total. ( VRSQRTPS is 3 uops, 2 for p0 and one for p1 / 5.) 256b mulps and fma are single-track instructions, and both can work on port 0 or on port 1. (p0 only on pre-Haswell), therefore, the bandwidth ability should be: one on 3c if the LLC engine sends the optimal execution ports. (i.e. shuffling uop from rsqrt always goes to p5, not p1, where it will occupy the mul / fma bandwidth.) As for overlapping with other calculations (and not just independent execution of itself), again this is pretty easy . Only 7 uops with a chain of 23-cycle segments leave plenty of room for other things to happen, while these uops sit in the reorder buffer.

If this is a step in a giant chain of segments where nothing happens (even an independent next iteration), then 256b vsqrtps is 19 delay cycles with a throughput of one for 14 cycles. (Haswell). If you still need feedback, then 256b vdivps also has a delay of 18-21 s, with one per bandwidth of 14 s. Therefore, for normal sqrt, the instruction has a lower latency. For a sqrt recipe, the approximation iteration is less latency. (And the FAR has more bandwidth if it overlaps with itself. If matching with other elements not sharing a unit, sqrtps not a problem.)

sqrtps can be a victory for bandwidth versus the Newton iteration of rsqrt + if it is part of the loop body with enough other undifferentiated and non-sqrt operations that go to the point that the division unit is not saturated.

This is especially true if you need sqrt(x) rather than 1/sqrt(x) , for example on Haswell with AVX2, the copy + arcsinh contour above the float array that fits into the L3 cache implemented as fastlog(v + sqrt(v*v + 1)) , works with approximately the same bandwidth with real VSQRTPS or with VRSQRTPS + iteration of Newton-Raphson. (Even with a very fast approximation for log () , so the overall loop body is about 9 FMA / add / mul / convert and 2 boolean operations, plus VSQRTPS ymm. Acceleration of use is only fastlog(v2_plus_1 * rsqrt(v2_plus_1) + v2_plus_1) , therefore it is not a bottleneck in memory bandwidth, but it can be a bottleneck in latency (therefore, out-of-order execution cannot use all parallelism of independent iterations).

Other comments

For completeness, there are no instructions for calculating half floats. You must convert on the fly when loading / storing using conversion instructions.

There is no rsqrtpd for double precision . Presumably, the thinking is that if you do not need full precision, you should use float first. This way you can convert to float and vice versa, then execute the same algorithm, but with pd instead of ps intrinsics. Or you could store your data as a float for a while. for example, convert two ymm double registers into one ymm single register.

Unfortunately, there is not a single instruction that takes two ymm registers doubles and outputs ymm singles. You should go ymm-> xmm twice, then _mm256_insertf128_ps one xmm to a high 128 of the other. But then you can feed this 256b ymm vector onto the same algorithm.

If you are going to convert back to double right, it may make sense to iterate rsqrt + Newton-Raphson on two 128-bit single-player registers separately. An additional 2 times for insertion / extraction, as well as an additional 2 uops for 256b rsqrt, begin to add up, not to mention the 3-stroke delay vinsertf128 / vextractf128 . The calculation will overlap, and both results will be ready for a couple of cycles. (Or 1 cycle separately, if you have a special version of your code for interleaving operations on 2 inputs simultaneously).

Remember that one precision has a smaller range of exponents than double, so the conversion can overflow to + Inf or underflow to 0.0. If you're not sure, definitely just use regular _mm_sqrt_pd .

With the AVX512F, there is _mm512_rsqrt14_pd( __m512d a) . With the AVX512ER (KNL, but not SKX or Cannonlake) , there is _mm512_rsqrt28_pd . _ps versions also exist. A 14-bit mantissa precision may be sufficient for use without Newton's iteration in other cases.

The Newton iteration will still not give you a properly rounded single-point float, as sqrt will normally do.

Fast vector rsqrt and vice versa with SSE / AVX depending on accuracy

ESSENCE

Other comments

More articles: