Saturated Subtraction - AVX or SSE4.2

I am improving program performance (C), and I can’t get the best execution time, improving the most expensive cycle.

I need to subtract 1 from each element of an unsigned long int array if the element is greater than zero.

Cycle:

unsigned long int * WorkerDataTime; ... for (WorkerID=0;WorkerID<WorkersON;++WorkerID){ if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1; } 

And I try this:

 for (WorkerID=0;WorkerID<WorkersON;++WorkerID){ int rest = WorkerDataTime[WorkerID] > 0; WorkerDataTime[WorkerID] = WorkerDataTime[WorkerID] - rest; } 

But the runtime is similar.

QUESTION: Is there any intrinsec instruction (SSE4.2, AVX ...) to do this directly? (I am using gcc 4.8.2)

I know this is possible with char or short elements. (_mm_subs_epi8 and _mm_subs_epi16), and I can't use AVX2.

Thanks.

+7
optimization c gcc sse avx
source share
2 answers

With SSE4, you can use three instructions. Here is the code that processes the entire array, decreasing all unsigned integers that are non-zero:

 void clampedDecrement_SSE (__m128i * data, size_t count) { // processes 2 elements each, no checks for alignment done. // count must be multiple of 2. size_t i; count /= 2; __m128i zero = _mm_set1_epi32(0); __m128i ones = _mm_set1_epi32(~0); for (i=0; i<count; i++) { __m128i values, mask; // load 2 64 bit integers: values = _mm_load_si128 (data); // compare against zero. Gives either 0 or ~0 (on match) mask = _mm_cmpeq_epi64 (values, zero); // negate above mask. Yields -1 for all non zero elements, 0 otherwise: mask = _mm_xor_si128(mask, ones); // now just add the mask for saturated unsigned decrement operation: values = _mm_add_epi64(values, mask); // and store the result back to memory: _mm_store_si128(data,values); data++; } } 

With AVX2, we can improve this and process 4 elements during:

 void clampedDecrement (__m256i * data, size_t count) { // processes 4 elements each, no checks for alignment done. // count must be multiple of 4. size_t i; count /= 4; // we need some constants: __m256i zero = _mm256_set1_epi32(0); __m256i ones = _mm256_set1_epi32(~0); for (i=0; i<count; i++) { __m256i values, mask; // load 4 64 bit integers: values = _mm256_load_si256 (data); // compare against zero. Gives either 0 or ~0 (on match) mask = _mm256_cmpeq_epi64 (values, zero); // negate above mask. Yields -1 for all non zero elements, 0 otherwise: mask = _mm256_xor_si256(mask, ones); // now just add the mask for saturated unsigned decrement operation: values = _mm256_add_epi64(values, mask); // and store the result back to memory: _mm256_store_si256(data,values); data++; } } 

EDIT: Added SSE code version.

+8
source share

If your processor does not have XOP, than there is no efficient way to compare 64-bit unsigned integers .

I tore up the following from the Agner Fog Vector Class Library . This shows how to compare unsigned 64-bit integers.

 static inline Vec2qb operator > (Vec2uq const & a, Vec2uq const & b) { #ifdef __XOP__ // AMD XOP instruction set return Vec2q(_mm_comgt_epu64(a,b)); #else // SSE2 instruction set __m128i sign32 = _mm_set1_epi32(0x80000000); // sign bit of each dword __m128i aflip = _mm_xor_si128(a,sign32); // a with sign bits flipped __m128i bflip = _mm_xor_si128(b,sign32); // b with sign bits flipped __m128i equal = _mm_cmpeq_epi32(a,b); // a == b, dwords __m128i bigger = _mm_cmpgt_epi32(aflip,bflip); // a > b, dwords __m128i biggerl = _mm_shuffle_epi32(bigger,0xA0); // a > b, low dwords copied to high dwords __m128i eqbig = _mm_and_si128(equal,biggerl); // high part equal and low part bigger __m128i hibig = _mm_or_si128(bigger,eqbig); // high part bigger or high part equal and low part bigger __m128i big = _mm_shuffle_epi32(hibig,0xF5); // result copied to low part return Vec2qb(Vec2q(big)); #endif } 

So, if your processor supports XOP, you should try compiling with -mxop and see if the loop is vectorized.

Edit: if GCC does not configure it the way you want and your processor has XOP, you can do

 for (WorkerID=0; WorkerID<WorkersON-1; workerID+=2){ __m128i v = _mm_loadu_si128((__m128i*)&WorkerDataTime[workerID]); __m128i cmp = _mm_comgt_epu64(v, _mm_setzero_si128()); v = _mm_add_epi64(v,cmp); _mm_storeu_si128((__m128i*)&WorkerDataTime[workerID], v); } for (;WorkerID<WorkersON;++WorkerID){ if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1; } 

Compile with -mxop and enable #include <x86intrin.h> .

Edit: as Nils Pipbenbrink pointed out, if you do not have XOP, you can do this with another command using _mm_xor_si128 :

 for (WorkerID=0; WorkerID<WorkersON-1; WorkerID+=2){ __m128i v = _mm_loadu_si128((__m128i*)&WorkerDataTime[workerID]); __m128i mask = _mm_cmpeq_epi64(v,_mm_setzero_si128()); mask = _mm_xor_si128(mask, _mm_set1_epi32(~0)); v= _mm_add_epi64(v,mask); _mm_storeu_si128((__m128i*)&WorkerDataTime[workerID], v); } for (;WorkerID<WorkersON;++WorkerID){ if(WorkerDataTime[WorkerID] > 0) WorkerDataTime[WorkerID]-=1; } 

Edit: Based on Stephen Canon's comment, I found out that there is a more efficient way to compare common 64-bit unsigned integers using the pcmpgtq from SSE4.2:

 __m128i a,b; __m128i sign64 = _mm_set1_epi64x(0x8000000000000000L); __m128i aflip = _mm_xor_si128(a, sign64); __m128i bflip = _mm_xor_si128(b, sign64); __m128i cmp = _mm_cmpgt_epi64(aflip,bflip); 
+5
source share

All Articles