Even without looking at the assembly, I can immediately say that the bottleneck is from 4-element access to the collection memory and from the packing operations _mm_set_epi32 . Internally _mm_set_epi32 , in your case, a series of unpacklo/hi instructions will probably be implemented.
Most of the βworkβ in this loop is to package these 4 memory accesses. In the absence of SSE4.1, I would go so far as to say that a loop can be faster non-vectorized, but it unfolds.
If you want to use SSE4.1, you can try this. It may be faster; it may not be:
int* logSumArray = (int*)(&logSumVector); __m128i valuesToXor = _mm_cvtsi32_si128(expTable[*(logSumArray++)]); valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 1); valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 2); valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 3);
I suggest that you unroll the loop at least 4 iterations and interleave all the instructions to give this code a chance of good execution.
What you really need is Intel AVX2 assembly / unloading instructions. But that after a few years along the way ...
Mysticial
source share