I'm still trying to find an example of how this will make a difference. I feel that if latency is a problem, there are times when it x+xwill be better, but if latency is not a problem, and only bandwidth matters, then it could be worse. But first, let's discuss some hardware.
Let me stick with Intel x86 processors ever since what I know best. Consider the following hardware generations: Core2 / Nehalem, SandyBridge / IvyBridge and Haswell / Broadwell .
Latency and bandwidth for SIMD floating-point arithmetic operations:
- The delay for adding is 3.
- With the exception of Broadwell, the latency for multiplication is 5.
- In Broadwell, multiplication has a latency of 3.
- 1.
- Haswell Broadwell, 1.
- 2.
- FMA 2.
- FMA 5
- FMA 2. 4.
, , 2. :
x = x*x - y*y + x0;
y = 2*xtemp*y + y0;
- SIMD (SSE AVX), (4 SSE, 8 AVX ). SIMD, intrinsics. y
y = xtemp*y + xtemp*y + y0
FMA?
y = fma(2*xtemp, y, y0)
y = xtemp*y + fma(xtemp, y, y0);
, . y=xtemp*y + xtemp*y + y0, , . , FMA , Haswell, . 15% FMA, 4 SSE 8 AVX, .
: , - , , .
for(int i=0; i<n; i++) y[i] = 2*x[i];
, . Haswell Broadwell , , x+x, Haswell/Broadwell 32 , .
, x+x .
for(int i=0; i<n; i++) prod = prod * (2*x[i]);
:
for(int i=0; i<n; i++) prod = prod * (x[i]+x[i]);
, prod. , , , , , , . , FMA, .
-
for(int i=0; i<n; i++) prod *= x[i];
prod *= pow(2,n);
x+x 2*x.