I have vectorized the following loop that appears in the application that I am developing:
void vecScl(Node** A, Node* B, long val){ int fact = round( dot / const); for(i=0; i<SIZE ;i++) (*A)->vector[i] -= fact * B->vector[i]; }
And this is the SSE code:
void vecSclSSE(Node** A, Node* B, long val){ int fact = round( dot / const); __m128i vecPi, vecQi, vecCi, vecQCi, vecResi; int sseBound = SIZE/4; for(i=0,j=0; j<sseBound ; i+=4,j++){ vecPi = _mm_loadu_si128((__m128i *)&((*A)->vector)[i] ); vecQi = _mm_set_epi32(fact,fact,fact,fact); vecCi = _mm_loadu_si128((__m128i *)&((B)->vector)[i] ); vecQCi = _mm_mullo_epi32(vecQi,vecCi); vecResi = _mm_sub_epi32(vecPi,vecQCi); _mm_storeu_si128((__m128i *) (((*A)->vector) + i), vecResi ); }
Although this works in terms of correctness, the performance is exactly the same as without SSE. I am compiling code with:
g++ *.cpp *.h -msse4.1 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -Warray-bounds -O2
Is this because I don't allocate (and use SSE functions accordingly) aligned memory? The code is very complicated to modify, so I kind of avoided this for now.
By the way, in terms of further improvements, and given that I am limited by the Sandy Bridge architecture, what is better, what can I do?
EDIT: The compiler has not vectorized the code yet . Firstly, I changed the data types of vectors to short s, which does not change the performance. Then I compiled with -fno-tree-vectorize , and the performance is the same.
Thank you so much
c ++ performance vectorization sse
a3mlord Mar 19 '14 at 15:27 2014-03-19 15:27
source share