Is there a faster way to multiply by 2 by SIMD (without using muliplication)?

The trick with old floats was used to never multiply by 2, but to add the operand with it, like, 2 * a = a + a. Is it possible to use the old trick for use with instruction sets SSE / SSE2 / SSSE3 / NEON / ... and the like? My operand will be a vector (say 4 floats that I want to multiply by 2). How about multiplying by 3, 4 ...?

+4
source share
2 answers

I'm still trying to find an example of how this will make a difference. I feel that if latency is a problem, there are times when it x+xwill be better, but if latency is not a problem, and only bandwidth matters, then it could be worse. But first, let's discuss some hardware.

Let me stick with Intel x86 processors ever since what I know best. Consider the following hardware generations: Core2 / Nehalem, SandyBridge / IvyBridge and Haswell / Broadwell .

Latency and bandwidth for SIMD floating-point arithmetic operations:

  • The delay for adding is 3.
  • With the exception of Broadwell, the latency for multiplication is 5.
  • In Broadwell, multiplication has a latency of 3.
  • 1.
  • Haswell Broadwell, 1.
  • 2.
  • FMA 2.
  • FMA 5
  • FMA 2. 4.

, , 2. :

x = x*x - y*y + x0;
y = 2*xtemp*y + y0;

- SIMD (SSE AVX), (4 SSE, 8 AVX ). SIMD, intrinsics. y

y = xtemp*y + xtemp*y + y0

FMA?

y = fma(2*xtemp, y, y0)

y = xtemp*y + fma(xtemp, y, y0);

, . y=xtemp*y + xtemp*y + y0, , . , FMA , Haswell, . 15% FMA, 4 SSE 8 AVX, .

: , - , , .

for(int i=0; i<n; i++) y[i] = 2*x[i];

, . Haswell Broadwell , , x+x, Haswell/Broadwell 32 , .

, x+x .

for(int i=0; i<n; i++) prod = prod * (2*x[i]);

:

for(int i=0; i<n; i++) prod = prod * (x[i]+x[i]);

, prod. , , , , , , . , FMA, .

-

for(int i=0; i<n; i++) prod *= x[i];
prod *= pow(2,n);

x+x 2*x.

+5

. x, 2.0 * x x + x . 2.0 * x x + x , , .

. . , . 2 * x 2 * y 2 * x y + y. 2 * x y + z, 2 * x x + x, , . , a * b + c . 2 * x + y (x + x) + y.

.

+5

All Articles