Complex Mul and Div with sse Instructions

Does complex multiplication and division profitable with SSE instructions? I know that addition and subtraction improve when using SSE. Can someone tell me how can I use SSE to do complex multiplication to improve performance?

+7
x86 sse simd
source share
3 answers

Complex well multiplication is defined as:

((c1a * c2a) - (c1b * c2b)) + ((c1b * c2a) + (c1a * c2b))i 

So, your 2 components in the complex number will be

 ((c1a * c2a) - (c1b * c2b)) and ((c1b * c2a) + (c1a * c2b))i 

Suppose you use 8 floats to represent 4 complex numbers, defined as follows:

 c1a, c1b, c2a, c2b c3a, c3b, c4a, c4b 

And you want to do (c1 * c3) and (c2 * c4) at the same time, your SSE code will look "something" like this:

(Note: I used MSVC under windows, but the principle will be the same).

 __declspec( align( 16 ) ) float c1c2[] = { 1.0f, 2.0f, 3.0f, 4.0f }; __declspec( align( 16 ) ) float c3c4[] = { 4.0f, 3.0f, 2.0f, 1.0f }; __declspec( align( 16 ) ) float mulfactors[] = { -1.0f, 1.0f, -1.0f, 1.0f }; __declspec( align( 16 ) ) float res[] = { 0.0f, 0.0f, 0.0f, 0.0f }; __asm { movaps xmm0, xmmword ptr [c1c2] // Load c1 and c2 into xmm0. movaps xmm1, xmmword ptr [c3c4] // Load c3 and c4 into xmm1. movaps xmm4, xmmword ptr [mulfactors] // load multiplication factors into xmm4 movaps xmm2, xmm1 movaps xmm3, xmm0 shufps xmm2, xmm1, 0xA0 // Change order to c3a c3a c4a c4a and store in xmm2 shufps xmm1, xmm1, 0xF5 // Change order to c3b c3b c4b c4b and store in xmm1 shufps xmm3, xmm0, 0xB1 // change order to c1b c1a c2b c2a abd store in xmm3 mulps xmm0, xmm2 mulps xmm3, xmm1 mulps xmm3, xmm4 // Flip the signs of the 'a so the add works correctly. addps xmm0, xmm3 // Add together movaps xmmword ptr [res], xmm0 // Store back out }; float res1a = (c1c2[0] * c3c4[0]) - (c1c2[1] * c3c4[1]); float res1b = (c1c2[1] * c3c4[0]) + (c1c2[0] * c3c4[1]); float res2a = (c1c2[2] * c3c4[2]) - (c1c2[3] * c3c4[3]); float res2b = (c1c2[3] * c3c4[2]) + (c1c2[2] * c3c4[3]); if ( res1a != res[0] || res1b != res[1] || res2a != res[2] || res2b != res[3] ) { _exit( 1 ); } 

What I did above, I simplified the math a bit. Assuming the following:

 c1a c1b c2a c2b c3a c3b c4a c4b 

Rebuilding, I end with the following vectors

 0 => c1a c1b c2a c2b 1 => c3b c3b c4b c4b 2 => c3a c3a c4a c4a 3 => c1b c1a c2b c2a 

Then I multiply 0 and 2 together to get:

 0 => c1a * c3a, c1b * c3a, c2a * c4a, c2b * c4a 

Then I multiply 3 and 1 together to get:

 3 => c1b * c3b, c1a * c3b, c2b * c4b, c2a * c4b 

Finally, I flip the signs of the pair of floats to 3

 3 => -(c1b * c3b), c1a * c3b, -(c2b * c4b), c2a * c4b 

So I can add them together and get

 (c1a * c3a) - (c1b * c3b), (c1b * c3a ) + (c1a * c3b), (c2a * c4a) - (c2b * c4b), (c2b * c4a) + (c2a * c4b) 

This is what we were after :)

+9
source share

For completeness only, the Intelยฎ 64 and IA-32 Architecture Optimization Reference Guide, which can be downloaded here, contains an assembly for complex multiplication (Example 6-9) and complex division (Example 6-10).

Here, for example, the multiplication code:

 // Multiplication of (ak + i bk ) * (ck + i dk ) // a + ib can be stored as a data structure movsldup xmm0, src1; load real parts into the destination, a1, a1, a0, a0 movaps xmm1, src2; load the 2nd pair of complex values, ie d1, c1, d0, c0 mulps xmm0, xmm1; temporary results, a1d1, a1c1, a0d0, a0c0 shufps xmm1, xmm1, b1; reorder the real and imaginary parts, c1, d1, c0, d0 movshdup xmm2, src1; load imaginary parts into the destination, b1, b1, b0, b0 mulps xmm2, xmm1; temporary results, b1c1, b1d1, b0c0, b0d0 addsubps xmm0, xmm2; b1c1+a1d1, a1c1 -b1d1, b0c0+a0d0, ; a0c0-b0d0 

The assembly maps directly to gccs X86 intrinsics (just a predicate of each command with __builtin_ia32_ ).

+8
source share

The algorithm in the Intel optimization task does not handle overflows and NaN input correctly.

A single NaN in the real or imaginary part of the number will be incorrectly extended to the other part.

Since several operations with infinity (for example, infinity * 0) end with NaN , overflow can cause NaN appear in your well-stored data.

If overflow and NaN are rare, an easy way to avoid this is to simply check the result of the NaN in the result and recompile it using an IEEE compatible compiler implementation:

 float complex a[2], b[2]; __m128 res = simd_fast_multiply(a, b); /* store unconditionally, can be executed in parallel with the check * making it almost free if there is no NaN in data */ _mm_store_ps(dest, res); /* check for NaN */ __m128 n = _mm_cmpneq_ps(res, res); int have_nan = _mm_movemask_ps(n); if (have_nan != 0) { /* do it again unvectorized */ dest[0] = a[0] * b[0]; dest[1] = a[1] * b[1]; } 
+4
source share

All Articles