Multiply Large Complex Number Vector by Scalar efficiently C ++

I'm currently trying to most efficiently multiply an array by complex numbers in place (the memory is aligned just like std :: complex will, but currently uses our own ADT) with an array of scalar values โ€‹โ€‹that is the same size as array of complex numbers.

The algorithm is already parallelized, i.e. the caller splits the work into threads. This calculation is performed on arrays of $ 100 million - so it may take some time. CUDA is not a solution for this product, although I would like to. I have access to boost and therefore have some potential for using BLAS / uBLAS.

I think, however, that SIMD can give much better results, but I am not good enough at how to do this with complex numbers. The code that I have now looks like this (remember that this is due to threads that correspond to the number of cores on the target machine). The target machine is also unknown. Thus, a general approach is probably best.

void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar) { for (register int idx = start; idx < end; ++idx) { values[idx].real *= scalar[idx]; values[idx].imag *= scalar[idx]; } } 

fcomplex is defined as follows:

 struct fcomplex { float real; float imag; }; 

I tried to manually expand the loop, since my loop counter will always have a value of 2, but the compiler is already doing this for me (I deployed to 32). I tried the const float reference to the scalar, thinking that I would keep one access - and this turned out to be equal to what the compiler already did. I tried STL and transformed which game is close to the results, but worse. I also tried casting in std :: complex and allowed it to use the overloaded operator for a scalar * complex for multiplication, but this ended up giving the same results.

So who has any ideas? Much attention is paid to your opinion in this regard! The target platform is Windows. I am using Visual Studio 2008. The product also cannot contain GPL code! Thank you very much.

+4
source share
4 answers

You can do this quite easily with SSE, for example.

 void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar) { for (int idx = start; idx < end; idx += 2) { __m128 vc = _mm_load_ps((float *)&values[idx]); __m128 vk = _mm_set_ps(scalar[idx + 1], scalar[idx + 1], scalar[idx], scalar[idx]); vc = _mm_mul_ps(vc, vk); _mm_store_ps((float *)&values[idx], vc); } } 

Note that values and scalar must be aligned by 16 bytes.

Or you could just use the Intel ICC compiler and let it do the hard work for you.


UPDATE

Here is an improved version that rotates the loop 2 times and uses one load command to get 4 scalar values, which are then decompressed into two vectors:

 void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar) { for (int idx = start; idx < end; idx += 4) { __m128 vc0 = _mm_load_ps((float *)&values[idx]); __m128 vc1 = _mm_load_ps((float *)&values[idx + 2]); __m128 vk = _mm_load_ps(&scalar[idx]); __m128 vk0 = _mm_shuffle_ps(vk, vk, 0x50); __m128 vk1 = _mm_shuffle_ps(vk, vk, 0xfa); vc0 = _mm_mul_ps(vc0, vk0); vc1 = _mm_mul_ps(vc1, vk1); _mm_store_ps((float *)&values[idx], vc0); _mm_store_ps((float *)&values[idx + 2], vc1); } } 
+1
source

Itโ€™s best to use an optimized BLAS that will use everything available on your target platform.

+1
source

One of the problems that I see is that in a function that is difficult for the compiler to understand that the scalar pointer does not point to the middle of a complex array ( scalar could theoretically point to the complex or real part of the complex). This actually forces the evaluation order.

Another problem that I see is that the calculation here is so simple that other factors will affect the raw speed, so if you really care about performance, the only solution, in my opinion, is to implement several options and test them at run time on the user to find out which is the fastest.

I would think that these are different spread sizes, as well as a game with scalar and values alignment (the memory access pattern can have a big impact of caching effects).

For the problem of unwanted serialization, the option is to see what the generated code is for something like

 float r0 = values[i].real, i0 = values[i].imag, s0 = scalar[i]; float r1 = values[i+1].real, i1 = values[i+1].imag, s1 = scalar[i+1]; float r2 = values[i+2].real, i2 = values[i+2].imag, s2 = scalar[i+2]; values[i].real = r0*s0; values[i].imag = i0*s0; values[i+1].real = r1*s1; values[i+1].imag = i1*s1; values[i+2].real = r2*s2; values[i+2].imag = i2*s2; 

because the optimizer theoretically has a little more freedom.

+1
source

Do you have access to Intel's integrated intelligence? Integrated Performance Performances They have a number of features that handle such cases with pretty decent performance. You may have some success with your specific problem, but I wonโ€™t be surprised if your compiler has already done a decent job of optimizing the code.

0
source

All Articles