Handling zeros in _mm256_rsqrt_ps ()

Given that _mm256_sqrt_ps() relatively slow, and that the values ​​I generate are truncated immediately with _mm256_floor_ps() , looking around it seems that:

 _mm256_mul_ps(_mm256_rsqrt_ps(eightFloats), eightFloats); 

Is there a way to go for this extra bit of performance and avoid pipeline downtime.

Unfortunately, with zero values, of course I get a 1/sqrt(0) failure calculation. What is the best way to get around this? I tried this (which works and works faster), but is there a better way, or will I run into problems under certain conditions?

 _mm256_mul_ps(_mm256_rsqrt_ps(_mm256_max_ps(eightFloats, _mm256_set1_ps(0.1))), eightFloats); 

My code is intended for a vertical application, so I can assume that it will work on the Haswell processor (i7-4810MQ), so you can use FMA / AVX2. Source code approximately:

 float vals[MAX]; int sum = 0; for (int i = 0; i < MAX; i++) { int thisSqrt = (int) floor(sqrt(vals[i])); sum += min(thisSqrt, 0x3F); } 

All vals values ​​must be integer. (Why is everything not just int - this is another question ...)

+1
c x86 sse avx intrinsics
Mar 09 '16 at 7:09
source share
1 answer

tl; dr : see the end of the code that compiles and should work.




To solve the 0.0 problem, you can also enter special 0.0 values ​​with FP comparing the source to 0.0 . Use the comparison result as a mask to nullify any NaNs resulting from 0 * +Infinity in sqrt(x) = x * rsqrt(x) ). Clang does this when autovectorizing. (But it uses zero-vector blendps , instead of using the comparison mask with andnps directly to zero or saving elements.)

You could also use sqrt(x) ~= recip(rsqrt(x)) , as suggested by njuffa . rsqrt(0) = +Inf . recip(+Inf) = 0 . However, using two approximations will attribute the relative error, which is a problem.




What you are missing:

Truncating integers (instead of rounding) requires an exact sqrt result when the input is a perfect square. If the result for 25 * rsqrt (25) is 4.999999 or something (instead of 5.00001), you will add 4 instead of 5 .

Even when iterating over Newton-Raphson, rsqrtps not exactly the exact way sqrtps , so it can still give 5.0 - 1ulp . (1ulp = one unit in last place = least significant bit of the mantissa).

also:

  • explained the formula of Newton Raphson
  • Newton Raphson SSE Implementation Efficiency (Latency / Bandwidth) . Note that we need bandwidth more than latency, since we use it in a loop that does nothing else. sqrt is not part of the chain of loop segments, so different iterations can immediately perform sqrt calculations in flight.



It would be possible to kill 2 birds with one stone, adding a small constant before making (x+offset)*approx_rsqrt(x+offset) , and then truncate it by an integer. Large enough to overcome the maximum relative error of 1.5 * 2 -12 but small enough not to press sqrt_approx(63*63-1+offset) to 63 (the most sensitive case).

 63*1.5*2^(-12) == 0.023071... approx_sqrt(63*63-1) == 62.99206... +/- 0.023068.. 

In fact, we spin without Newton's iteration, without even adding anything . approx_sqrt(63*63-1) could exit above 63.0 on its own. n=36 is the largest value, where the relative error in sqrt(n*n-1) + error less than sqrt(n*n) . GNU Calc:

 define f(n) { local x=sqrt(n*n-1); local e=x*1.5*2^(-12); print x; print e, x+e; } ; f(36) 35.98610843089316319413 ~0.01317850650545403926 ~35.99928693739861723339 ; f(37) 36.9864840178138587015 ~0.01354485498699237990 ~37.00002887280085108140 

Does your source data have any properties that mean you don’t have to worry about being under a big perfect square? . Are these always perfect squares?

You can check all possible input values, since the important domain is very small (integer FP values ​​from 0..63 * 63) to check if in practice there is enough practical error in Intel Haswell, but this would be a fragile optimization, which can lead to code breaking on AMD processors or even on future Intel processors. Unfortunately, only the coding of the ISA specification ensures that a relative error of up to 1.5 * 2 -12 requires additional instructions. I don't see any tricks in the NR iteration.

If your upper limit was less (e.g. 20), you could just do isqrt = static_cast<int> ((x+0.5)*approx_rsqrt(x+0.5)) . You will get 20 for 20*20 , but always 19 for 20*20-1 .

 ; define test_approx_sqrt(x, off) { local s=x*x+off; local sq=s/sqrt(s); local sq_1=(s-1)/sqrt(s-1); local e=1.5*2^(-12); print sq, sq_1; print sq*e, sq_1*e; } ; test_approx_sqrt(20, 0.5) ~20.01249609618950056874 ~19.98749609130668473087 # (x+0.5)/sqrt(x+0.5) ~0.00732879495710064718 ~0.00731963968187500662 # relative error 

Note that val * (x +/- err) = val*x +/- val*err . IEEE FP mul produces results that are correctly rounded to 0.5ulp, so this should work for relative FP errors.




Anyway, I think you need one Newton-Raphson iteration.

It is best to add 0.5 to your input before doing approx_sqrt with rsqrt . This wraps around the 0/0 = NaN problem and pushes the +/- error range on one side of the common cutoff point (for numbers in the range we care about).

FP min / max commands have the same performance as FP add, and will be on a critical path anyway. Using add instead of max also solves the problem of results for perfect squares, perhaps a few ulp below the correct result .




Compiler Output: A Worthy Starting Point

I get pretty good sum_int results from clang 3.7.1 with sum_int , with -fno-math-errno -funsafe-math-optimizations . -ffinite-math-only not required (but even when using full -ffast-math clang avoids sqrt(0) = NaN when using rsqrtps ).

sum_fp not auto-vectorized, even with full -ffast-math .

However, the clang version suffers from the same problem as your idea: truncating an inaccurate result from rsqrt + NR, which potentially gives the wrong integer. IDK, if that is why gcc does not auto-inject, since it could use sqrtps for great acceleration without changing the results. (At least as long as all floats are between 0 and INT_MAX 2 , otherwise converting back to an integer will give an undefined result INT_MIN. (The sign bit is set, all other bits are cleared). This is the case when -ffast-math breaks your program, unless you use -mrecip=none or something like that.

See the asm output on godbolt from:

 // autovectorizes with clang, but has rounding problems. // Note the use of sqrtf, and that floorf before truncating to int is redundant. (removed because clang doesn't optimize away the roundps) int sum_int(float vals[]){ int sum = 0; for (int i = 0; i < MAX; i++) { int thisSqrt = (int) sqrtf(vals[i]); sum += std::min(thisSqrt, 0x3F); } return sum; } 

To manually vectorize using the built-in functions, we can see the asm output from -fno-unroll-loops (to make things simple). I was going to include this in the answer, but then I realized that he had problems.




grouping it:

I think converting to int inside a loop is better than using floorf and then addps . roundps is a 2-uop (6c latency) instruction on Haswell (1uop in SnB / IvB). Worse, uops require port1, so they compete with FP add / mul. cvttps2dq is a 1-uop instruction for port 1 with a delay of 3 s, and then we can use the integer min and add to clamp and accumulate, so port5 gets to do something. Using an integer vector accumulator also means that the chain of dependencies associated with the cycle is 1 cycle, so we do not need to deploy or use multiple batteries to support multiple iterations in flight. Smaller code is always better for a large image (uop cache, L1 I-cache, branch predictors).

While we are not in danger of overflowing 32-bit batteries, this is probably the best choice. (Without testing anything or even checking it out).

I do not use the sqrt(x) ~= approx_recip(approx_sqrt(x)) method because I don’t know how to iterate Newton to refine it (maybe this is related to division). And since the complicated mistake is bigger.

The horizontal sum of this answer .

Full but unverified version:

 #include <immintrin.h> #define MAX 4096 // 2*sqrt(x) ~= 2*x*approx_rsqrt(x), with a Newton-Raphson iteration // dividing by 2 is faster in the integer domain, so we don't do it __m256 approx_2sqrt_ps256(__m256 x) { // clang / gcc usually use -3.0 and -0.5. We could do the same by using fnmsub_ps (add 3 = subtract -3), so we can share constants __m256 three = _mm256_set1_ps(3.0f); //__m256 half = _mm256_set1_ps(0.5f); // we omit the *0.5 step __m256 nr = _mm256_rsqrt_ps( x ); // initial approximation for Newton-Raphson // 1/sqrt(x) ~= nr * (3 - x*nr * nr) * 0.5 = nr*(1.5 - x*0.5*nr*nr) // sqrt(x) = x/sqrt(x) ~= (x*nr) * (3 - x*nr * nr) * 0.5 // 2*sqrt(x) ~= (x*nr) * (3 - x*nr * nr) __m256 xnr = _mm256_mul_ps( x, nr ); __m256 three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three ); // -(xnr*nr) + 3 return _mm256_mul_ps( xnr, three_minus_muls ); } // packed int32_t: correct results for inputs from 0 to well above 63*63 __m256i isqrt256_ps(__m256 x) { __m256 offset = _mm256_set1_ps(0.5f); // or subtract -0.5, to maybe share constants with compiler-generated Newton iterations. __m256 xoff = _mm256_add_ps(x, offset); // avoids 0*Inf = NaN, and rounding error before truncation __m256 approx_2sqrt_xoff = approx_2sqrt_ps256(xoff); __m256i i2sqrtx = _mm256_cvttps_epi32(approx_2sqrt_xoff); return _mm256_srli_epi32(i2sqrtx, 1); // divide by 2 with truncation // alternatively, we could mask the low bit to zero and divide by two outside the loop, but that has no advantage unless port0 turns out to be the bottleneck } __m256i isqrt256_ps_simple_exact(__m256 x) { __m256 sqrt_x = _mm256_sqrt_ps(x); __m256i isqrtx = _mm256_cvttps_epi32(sqrt_x); return isqrtx; } int hsum_epi32_avx(__m256i x256){ __m128i xhi = _mm256_extracti128_si256(x256, 1); __m128i xlo = _mm256_castsi256_si128(x256); __m128i x = _mm_add_epi32(xlo, xhi); __m128i hl = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2)); hl = _mm_add_epi32(hl, x); x = _mm_shuffle_epi32(hl, _MM_SHUFFLE(2, 3, 0, 1)); hl = _mm_add_epi32(hl, x); return _mm_cvtsi128_si32(hl); } int sum_int_avx(float vals[]){ __m256i sum = _mm256_setzero_si256(); __m256i upperlimit = _mm256_set1_epi32(0x3F); for (int i = 0; i < MAX; i+=8) { __m256 v = _mm256_loadu_ps(vals+i); __m256i visqrt = isqrt256_ps(v); // assert visqrt == isqrt256_ps_simple_exact(v) or something visqrt = _mm256_min_epi32(visqrt, upperlimit); sum = _mm256_add_epi32(sum, visqrt); } return hsum_epi32_avx(sum); } 

Compiles on godbolt into good code , but I have not tested it. clang makes a slightly nicer code that gcc: clang uses broadcast loads from 4B locations for set1 constants, rather than repeating them at compile time to 32B constants. gcc also has a strange movdqa to copy case.

In any case, the whole cycle ends with only 9 vector instructions, compared to 12 for the sum_int version sum_int generated by the compiler. He probably did not notice the general subexpressions x*initial_guess(x) that occur in the Newton-Raphson iteration formula when you multiply the result by x or something like that. It also does extra mulps instead of psrld because it does * 0.5 before converting to int. Thus, where the additional two mulps instructions come from, and there cmpps / blendvps.

 sum_int_avx(float*): vpxor ymm3, ymm3, ymm3 xor eax, eax vbroadcastss ymm0, dword ptr [rip + .LCPI4_0] ; set1(0.5) vbroadcastss ymm1, dword ptr [rip + .LCPI4_1] ; set1(3.0) vpbroadcastd ymm2, dword ptr [rip + .LCPI4_2] ; set1(63) LBB4_1: ; latencies vaddps ymm4, ymm0, ymmword ptr [rdi + 4*rax] ; 3c vrsqrtps ymm5, ymm4 ; 7c vmulps ymm4, ymm4, ymm5 ; x*nr ; 5c vfnmadd213ps ymm5, ymm4, ymm1 ; 5c vmulps ymm4, ymm4, ymm5 ; 5c vcvttps2dq ymm4, ymm4 ; 3c vpsrld ymm4, ymm4, 1 ; 1c this would be a mulps (but not on the critical path) if we did this in the FP domain vpminsd ymm4, ymm4, ymm2 ; 1c vpaddd ymm3, ymm4, ymm3 ; 1c ; ... (those 9 insns repeated: loop unrolling) add rax, 16 cmp rax, 4096 jl .LBB4_1 ;... horizontal sum 

IACA believes that indiscriminately, Haswell can maintain the throughput of one iteration in 4.15 cycles, a bottleneck on ports 0 and 1. Thus, you could shave the cycle by accumulating sqrt (x) * 2 (truncating to even numbers using _mm256_and_si256 ) and divide only into two outside the loop.

Also, according to IACA, the latency of one iteration is 38 cycles per Haswell. I only get 31c, so it probably includes L1 download latency or something else. In any case, this means that in order to saturate the execution units, operations with 8 iterations must be in flight immediately. These are 8 * ~ 14 invalid domains uops = 112 unused-uops (or less with clang unroll), which should be in flight immediately. Haswell's planner is actually only 60 entries, but ROB has 192 entries . The early steps from the early iterations have already been completed, so they need to be tracked only in ROB, not in the scheduler. However, many of the slow uops are at the beginning of each iteration. Nevertheless, there is reason to hope that this will approach saturating ports 0 and 1. If data does not burn in L1 cache, cache / memory bandwidth is likely to be a bottleneck.

Alternating multiple chains of segments would also be better. When clang expands, it places all 9 instructions for one iteration before all 9 instructions for another iteration. It uses a surprisingly small number of registers, so one could have instructions for 2 or 4 iterations mixed together. This is what compilers should be good about, but which are cumbersome for people.: /

It would also be slightly more efficient if the compiler opted for single-mode addressing, so the load could be combined using vaddps . gcc does this.

+1
Mar 09 '16 at 13:40
source share



All Articles