Performance is the same with SSE

Question

Performance is the same with SSE

I have vectorized the following loop that appears in the application that I am developing:

void vecScl(Node** A, Node* B, long val){ int fact = round( dot / const); for(i=0; i<SIZE ;i++) (*A)->vector[i] -= fact * B->vector[i]; }

And this is the SSE code:

 void vecSclSSE(Node** A, Node* B, long val){ int fact = round( dot / const); __m128i vecPi, vecQi, vecCi, vecQCi, vecResi; int sseBound = SIZE/4; for(i=0,j=0; j<sseBound ; i+=4,j++){ vecPi = _mm_loadu_si128((__m128i *)&((*A)->vector)[i] ); vecQi = _mm_set_epi32(fact,fact,fact,fact); vecCi = _mm_loadu_si128((__m128i *)&((B)->vector)[i] ); vecQCi = _mm_mullo_epi32(vecQi,vecCi); vecResi = _mm_sub_epi32(vecPi,vecQCi); _mm_storeu_si128((__m128i *) (((*A)->vector) + i), vecResi ); } //Compute remaining positions if SIZE % 4 != 0 for(; i<SIZE ;i++) (*A)->vector[i] -= q * B->vector[i]; }

Although this works in terms of correctness, the performance is exactly the same as without SSE. I am compiling code with:

  g++ *.cpp *.h -msse4.1 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -Warray-bounds -O2

Is this because I don't allocate (and use SSE functions accordingly) aligned memory? The code is very complicated to modify, so I kind of avoided this for now.

By the way, in terms of further improvements, and given that I am limited by the Sandy Bridge architecture, what is better, what can I do?

EDIT: The compiler has not vectorized the code yet . Firstly, I changed the data types of vectors to short s, which does not change the performance. Then I compiled with -fno-tree-vectorize , and the performance is the same.

Thank you so much

+4

c ++ performance vectorization sse

a3mlord Mar 19 '14 at 15:27

source share

3 answers

Paul R · Answer 1 · 2014-03-19 15:49

If your data is large, you can just snap to memory, since you perform very few ALU operations for each boot / storage.

However, there are a few minor improvements you can try:

 inline void vecSclSSE(Node** A, Node* B, long val){ // make function inline, for cases where `val` is small const int fact = (dot + const / 2 - 1) / const; // use integer arithmetic here if possible const __m128i vecQi = _mm_set1_epi32(fact); // hoist constant initialisation out of loop int32_t * const pA = (*A)->vector; // hoist invariant de-references out of loop int32_t * const pB = B->vector; __m128i vecPi, vecCi, vecQCi, vecResi; for(int i = 0; i < SIZE - 3; i += 4) { // use one loop variable vecPi = _mm_loadu_si128((__m128i *)&(pA[i])); vecCi = _mm_loadu_si128((__m128i *)&(pB[i])); vecQCi = _mm_mullo_epi32(vecQi,vecCi); vecResi = _mm_sub_epi32(vecPi,vecQCi); _mm_storeu_si128((__m128i *)&(pA[i]), vecResi); } //Compute remaining positions if SIZE % 4 != 0 for(; i<SIZE ;i++) pA[i] -= q * pB[i]; }

Antoine · Answer 2 · 2014-03-19 16:01

As Paul said, you have a relatively small amount of data access computing, and your code is probably IO related. Since unbalanced storage / loads are slower than aligned ones, you really have to align your data.

You should align 16 bytes with SSE, which is also a cache line, and (I think) 32 with AVX. If you _aligned_alloc your data yourself, just use _aligned_alloc . If you use std::vector , the easiest way to align is to use the std::allocator native dispenser instead. This allocator will call _aligned_alloc or something like this instead of malloc / new . See also this question .

And then you can switch to the aligned instructions for loading / storing.

In addition, I'm not sure which code is generated by &((*A)->vector)[i] , it is better to use a local pointer to store data, but be sure to check it with __restrict

But before delving into all of this, make sure it is worth your time and the burden of support. You can create an oprofile profile for linux or an AMD code analyst for windows.

a3mlord · Answer 3 · 2014-03-20 12:14

I would like to say that for the same SIZE , I was able to vectorize the kernel that happens right before the one in the first message. This time I had great accelerations (I won’t talk about it because it doesn’t matter if I didn’t quantify the time spent on the kernel in the whole application). The kernel calculates the product of the points of two vectors, i.e.:

 for(i=0;i<SIZE;i++) dot += A->vector[i] * B->vector[i];

From here I can conclude that SIZE size is not a problem. This, in turn, suggests that I may be wrong in the first core. Can someone suggest a different set of SSE operations for the first core? I think it's worth a try. The next step is to allocate aligned memory, but as mentioned earlier, this is not critical in Sandy Bridge and other new architectures.

This once again confirmed that the compiler was not a vectorization of code.

thank

Performance is the same with SSE

More articles: