I'm currently trying to most efficiently multiply an array by complex numbers in place (the memory is aligned just like std :: complex will, but currently uses our own ADT) with an array of scalar values โโthat is the same size as array of complex numbers.
The algorithm is already parallelized, i.e. the caller splits the work into threads. This calculation is performed on arrays of $ 100 million - so it may take some time. CUDA is not a solution for this product, although I would like to. I have access to boost and therefore have some potential for using BLAS / uBLAS.
I think, however, that SIMD can give much better results, but I am not good enough at how to do this with complex numbers. The code that I have now looks like this (remember that this is due to threads that correspond to the number of cores on the target machine). The target machine is also unknown. Thus, a general approach is probably best.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar) { for (register int idx = start; idx < end; ++idx) { values[idx].real *= scalar[idx]; values[idx].imag *= scalar[idx]; } }
fcomplex is defined as follows:
struct fcomplex { float real; float imag; };
I tried to manually expand the loop, since my loop counter will always have a value of 2, but the compiler is already doing this for me (I deployed to 32). I tried the const float reference to the scalar, thinking that I would keep one access - and this turned out to be equal to what the compiler already did. I tried STL and transformed which game is close to the results, but worse. I also tried casting in std :: complex and allowed it to use the overloaded operator for a scalar * complex for multiplication, but this ended up giving the same results.
So who has any ideas? Much attention is paid to your opinion in this regard! The target platform is Windows. I am using Visual Studio 2008. The product also cannot contain GPL code! Thank you very much.