tl; dr : see the end of the code that compiles and should work.
To solve the 0.0 problem, you can also enter special 0.0 values with FP comparing the source to 0.0 . Use the comparison result as a mask to nullify any NaNs resulting from 0 * +Infinity in sqrt(x) = x * rsqrt(x) ). Clang does this when autovectorizing. (But it uses zero-vector blendps , instead of using the comparison mask with andnps directly to zero or saving elements.)
You could also use sqrt(x) ~= recip(rsqrt(x)) , as suggested by njuffa . rsqrt(0) = +Inf . recip(+Inf) = 0 . However, using two approximations will attribute the relative error, which is a problem.
What you are missing:
Truncating integers (instead of rounding) requires an exact sqrt result when the input is a perfect square. If the result for 25 * rsqrt (25) is 4.999999 or something (instead of 5.00001), you will add 4 instead of 5 .
Even when iterating over Newton-Raphson, rsqrtps not exactly the exact way sqrtps , so it can still give 5.0 - 1ulp . (1ulp = one unit in last place = least significant bit of the mantissa).
also:
- explained the formula of Newton Raphson
- Newton Raphson SSE Implementation Efficiency (Latency / Bandwidth) . Note that we need bandwidth more than latency, since we use it in a loop that does nothing else. sqrt is not part of the chain of loop segments, so different iterations can immediately perform sqrt calculations in flight.
It would be possible to kill 2 birds with one stone, adding a small constant before making (x+offset)*approx_rsqrt(x+offset) , and then truncate it by an integer. Large enough to overcome the maximum relative error of 1.5 * 2 -12 but small enough not to press sqrt_approx(63*63-1+offset) to 63 (the most sensitive case).
63*1.5*2^(-12) == 0.023071... approx_sqrt(63*63-1) == 62.99206... +/- 0.023068..
In fact, we spin without Newton's iteration, without even adding anything . approx_sqrt(63*63-1) could exit above 63.0 on its own. n=36 is the largest value, where the relative error in sqrt(n*n-1) + error less than sqrt(n*n) . GNU Calc:
define f(n) { local x=sqrt(n*n-1); local e=x*1.5*2^(-12); print x; print e, x+e; } ; f(36) 35.98610843089316319413 ~0.01317850650545403926 ~35.99928693739861723339 ; f(37) 36.9864840178138587015 ~0.01354485498699237990 ~37.00002887280085108140
Does your source data have any properties that mean you don’t have to worry about being under a big perfect square? . Are these always perfect squares?
You can check all possible input values, since the important domain is very small (integer FP values from 0..63 * 63) to check if in practice there is enough practical error in Intel Haswell, but this would be a fragile optimization, which can lead to code breaking on AMD processors or even on future Intel processors. Unfortunately, only the coding of the ISA specification ensures that a relative error of up to 1.5 * 2 -12 requires additional instructions. I don't see any tricks in the NR iteration.
If your upper limit was less (e.g. 20), you could just do isqrt = static_cast<int> ((x+0.5)*approx_rsqrt(x+0.5)) . You will get 20 for 20*20 , but always 19 for 20*20-1 .
; define test_approx_sqrt(x, off) { local s=x*x+off; local sq=s/sqrt(s); local sq_1=(s-1)/sqrt(s-1); local e=1.5*2^(-12); print sq, sq_1; print sq*e, sq_1*e; } ; test_approx_sqrt(20, 0.5) ~20.01249609618950056874 ~19.98749609130668473087 # (x+0.5)/sqrt(x+0.5) ~0.00732879495710064718 ~0.00731963968187500662
Note that val * (x +/- err) = val*x +/- val*err . IEEE FP mul produces results that are correctly rounded to 0.5ulp, so this should work for relative FP errors.
Anyway, I think you need one Newton-Raphson iteration.
It is best to add 0.5 to your input before doing approx_sqrt with rsqrt . This wraps around the 0/0 = NaN problem and pushes the +/- error range on one side of the common cutoff point (for numbers in the range we care about).
FP min / max commands have the same performance as FP add, and will be on a critical path anyway. Using add instead of max also solves the problem of results for perfect squares, perhaps a few ulp below the correct result .
Compiler Output: A Worthy Starting Point
I get pretty good sum_int results from clang 3.7.1 with sum_int , with -fno-math-errno -funsafe-math-optimizations . -ffinite-math-only not required (but even when using full -ffast-math clang avoids sqrt(0) = NaN when using rsqrtps ).
sum_fp not auto-vectorized, even with full -ffast-math .
However, the clang version suffers from the same problem as your idea: truncating an inaccurate result from rsqrt + NR, which potentially gives the wrong integer. IDK, if that is why gcc does not auto-inject, since it could use sqrtps for great acceleration without changing the results. (At least as long as all floats are between 0 and INT_MAX 2 , otherwise converting back to an integer will give an undefined result INT_MIN. (The sign bit is set, all other bits are cleared). This is the case when -ffast-math breaks your program, unless you use -mrecip=none or something like that.
See the asm output on godbolt from:
// autovectorizes with clang, but has rounding problems. // Note the use of sqrtf, and that floorf before truncating to int is redundant. (removed because clang doesn't optimize away the roundps) int sum_int(float vals[]){ int sum = 0; for (int i = 0; i < MAX; i++) { int thisSqrt = (int) sqrtf(vals[i]); sum += std::min(thisSqrt, 0x3F); } return sum; }
To manually vectorize using the built-in functions, we can see the asm output from -fno-unroll-loops (to make things simple). I was going to include this in the answer, but then I realized that he had problems.
grouping it:
I think converting to int inside a loop is better than using floorf and then addps . roundps is a 2-uop (6c latency) instruction on Haswell (1uop in SnB / IvB). Worse, uops require port1, so they compete with FP add / mul. cvttps2dq is a 1-uop instruction for port 1 with a delay of 3 s, and then we can use the integer min and add to clamp and accumulate, so port5 gets to do something. Using an integer vector accumulator also means that the chain of dependencies associated with the cycle is 1 cycle, so we do not need to deploy or use multiple batteries to support multiple iterations in flight. Smaller code is always better for a large image (uop cache, L1 I-cache, branch predictors).
While we are not in danger of overflowing 32-bit batteries, this is probably the best choice. (Without testing anything or even checking it out).
I do not use the sqrt(x) ~= approx_recip(approx_sqrt(x)) method because I don’t know how to iterate Newton to refine it (maybe this is related to division). And since the complicated mistake is bigger.
The horizontal sum of this answer .
Full but unverified version:
#include <immintrin.h>
Compiles on godbolt into good code , but I have not tested it. clang makes a slightly nicer code that gcc: clang uses broadcast loads from 4B locations for set1 constants, rather than repeating them at compile time to 32B constants. gcc also has a strange movdqa to copy case.
In any case, the whole cycle ends with only 9 vector instructions, compared to 12 for the sum_int version sum_int generated by the compiler. He probably did not notice the general subexpressions x*initial_guess(x) that occur in the Newton-Raphson iteration formula when you multiply the result by x or something like that. It also does extra mulps instead of psrld because it does * 0.5 before converting to int. Thus, where the additional two mulps instructions come from, and there cmpps / blendvps.
sum_int_avx(float*): vpxor ymm3, ymm3, ymm3 xor eax, eax vbroadcastss ymm0, dword ptr [rip + .LCPI4_0] ; set1(0.5) vbroadcastss ymm1, dword ptr [rip + .LCPI4_1] ; set1(3.0) vpbroadcastd ymm2, dword ptr [rip + .LCPI4_2] ; set1(63) LBB4_1: ; latencies vaddps ymm4, ymm0, ymmword ptr [rdi + 4*rax] ; 3c vrsqrtps ymm5, ymm4 ; 7c vmulps ymm4, ymm4, ymm5 ; x*nr ; 5c vfnmadd213ps ymm5, ymm4, ymm1 ; 5c vmulps ymm4, ymm4, ymm5 ; 5c vcvttps2dq ymm4, ymm4 ; 3c vpsrld ymm4, ymm4, 1 ; 1c this would be a mulps (but not on the critical path) if we did this in the FP domain vpminsd ymm4, ymm4, ymm2 ; 1c vpaddd ymm3, ymm4, ymm3 ; 1c ; ... (those 9 insns repeated: loop unrolling) add rax, 16 cmp rax, 4096 jl .LBB4_1 ;... horizontal sum
IACA believes that indiscriminately, Haswell can maintain the throughput of one iteration in 4.15 cycles, a bottleneck on ports 0 and 1. Thus, you could shave the cycle by accumulating sqrt (x) * 2 (truncating to even numbers using _mm256_and_si256 ) and divide only into two outside the loop.
Also, according to IACA, the latency of one iteration is 38 cycles per Haswell. I only get 31c, so it probably includes L1 download latency or something else. In any case, this means that in order to saturate the execution units, operations with 8 iterations must be in flight immediately. These are 8 * ~ 14 invalid domains uops = 112 unused-uops (or less with clang unroll), which should be in flight immediately. Haswell's planner is actually only 60 entries, but ROB has 192 entries . The early steps from the early iterations have already been completed, so they need to be tracked only in ROB, not in the scheduler. However, many of the slow uops are at the beginning of each iteration. Nevertheless, there is reason to hope that this will approach saturating ports 0 and 1. If data does not burn in L1 cache, cache / memory bandwidth is likely to be a bottleneck.
Alternating multiple chains of segments would also be better. When clang expands, it places all 9 instructions for one iteration before all 9 instructions for another iteration. It uses a surprisingly small number of registers, so one could have instructions for 2 or 4 iterations mixed together. This is what compilers should be good about, but which are cumbersome for people.: /
It would also be slightly more efficient if the compiler opted for single-mode addressing, so the load could be combined using vaddps . gcc does this.