How can I add two SSE registers

I have two SSE registers (128 bits - one register), and I want to add them. I know how I can add the corresponding words to them, for example, I can do this with _mm_add_epi16 if I use 16-bit words in registers, but what I want is something like _mm_add_epi128 (which does not exist) which will use case as one big word. Is there a way to perform this operation, even if multiple instructions are required?
I thought about using _mm_add_epi64 , finding an overflow in the correct word and then adding 1 to the left word in the register, if necessary, but I would also like this approach to work for 256-bit registers (AVX2), and this approach seems too difficult for that.

+6
c ++ c sse intel avx2
source share
1 answer

To add two 128-bit numbers x and y , to give z using SSE, you can do it like this:

 z = _mm_add_epi64(x,y); c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x)); z = _mm_sub_epi64(z,c); 

This is based on this link how-can-i-add-and-subtract-128-bit-integers-in-c-or-c .

The unsigned_lessthan function is defined below. This is complicated without AMD XOP (actually found a simpler version for SSE4.2 if XOP is not available - see the end of my answer). Probably some of the other people may suggest a better method. Here is some code that shows that this works.

 #include <stdint.h> #include <x86intrin.h> #include <stdio.h> inline __m128i unsigned_lessthan(__m128i a, __m128i b) { #ifdef __XOP__ // AMD XOP instruction set return _mm_comgt_epu64(b,a)); #else // SSE2 instruction set __m128i sign32 = _mm_set1_epi32(0x80000000); // sign bit of each dword __m128i aflip = _mm_xor_si128(b,sign32); // a with sign bits flipped __m128i bflip = _mm_xor_si128(a,sign32); // b with sign bits flipped __m128i equal = _mm_cmpeq_epi32(b,a); // a == b, dwords __m128i bigger = _mm_cmpgt_epi32(aflip,bflip); // a > b, dwords __m128i biggerl = _mm_shuffle_epi32(bigger,0xA0); // a > b, low dwords copied to high dwords __m128i eqbig = _mm_and_si128(equal,biggerl); // high part equal and low part bigger __m128i hibig = _mm_or_si128(bigger,eqbig); // high part bigger or high part equal and low part __m128i big = _mm_shuffle_epi32(hibig,0xF5); // result copied to low part return big; #endif } int main() { __m128i x,y,z,c; x = _mm_set_epi64x(3,0xffffffffffffffffll); y = _mm_set_epi64x(1,0x2ll); z = _mm_add_epi64(x,y); c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x)); z = _mm_sub_epi64(z,c); int out[4]; //int64_t out[2]; _mm_storeu_si128((__m128i*)out, z); printf("%d %d\n", out[2], out[0]); } 

Edit:

The only potentially effective way to add 128-bit or 256-bit numbers with SSE is through XOP. The only option with AVX is XOP2, which is not yet available. And even if you have XOP, it can only be useful to add two 128-bit or 256-numbers in parallel (you can make four with AVX if XOP2 exists) to avoid horizontal instructions like mm_unpacklo_epi64 .

The best solution in the general case is to push the registers onto the stack and use scalar arithmetic. Assuming you have two 256-bit registers x4 and y4, you can add them like this:

 __m256i x4, y4, z4; uint64_t x[4], uint64_t y[4], uint64_t z[4] _mm256_storeu_si256((__m256i*)x, x4); _mm256_storeu_si256((__m256i*)y, y4); add_u256(x,y,z); z4 = _mm256_loadu_si256((__m256i*)z); void add_u256(uint64_t x[4], uint64_t y[4], uint64_t z[4]) { uint64_t c1 = 0, c2 = 0, tmp; //add low 128-bits z[0] = x[0] + y[0]; z[1] = x[1] + y[1]; c1 += z[1]<x[1]; tmp = z[1]; z[1] += z[0]<x[0]; c1 += z[1]<tmp; //add high 128-bits + carry from low 128-bits z[2] = x[2] + y[2]; c2 += z[2]<x[2]; tmp = z[2]; z[2] += c1; c2 += z[2]<tmp; z[3] = x[3] + y[3] + c2; } int main() { uint64_t x[4], y[4], z[4]; x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1; y[0] = 1; y[1] = 1; y[2] = 1; y[3] = 1; //z = x + y (x3,x2,x1,x0) = (2,3,1,0) //x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1; //y[0] = 1; y[1] = 0; y[2] = 1; y[3] = 1; //z = x + y (x3,x2,x1,x0) = (2,3,0,0) add_u256(x,y,z); for(int i=3; i>=0; i--) printf("%u ", z[i]); printf("\n"); } 

Edit: based on Stephen Canon's comment on saturated-substraction-avx-or-sse4-2 I found that there is a more efficient way to compare 64-bit unsigned numbers with SSE4.2 if XOP is not available.

 __m128i a,b; __m128i sign64 = _mm_set1_epi64x(0x8000000000000000L); __m128i aflip = _mm_xor_si128(a, sign64); __m128i bflip = _mm_xor_si128(b, sign64); __m128i cmp = _mm_cmpgt_epi64(aflip,bflip); 
+7
source share

All Articles