To add two 128-bit numbers x and y , to give z using SSE, you can do it like this:
z = _mm_add_epi64(x,y); c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x)); z = _mm_sub_epi64(z,c);
This is based on this link how-can-i-add-and-subtract-128-bit-integers-in-c-or-c .
The unsigned_lessthan function is defined below. This is complicated without AMD XOP (actually found a simpler version for SSE4.2 if XOP is not available - see the end of my answer). Probably some of the other people may suggest a better method. Here is some code that shows that this works.
#include <stdint.h>
Edit:
The only potentially effective way to add 128-bit or 256-bit numbers with SSE is through XOP. The only option with AVX is XOP2, which is not yet available. And even if you have XOP, it can only be useful to add two 128-bit or 256-numbers in parallel (you can make four with AVX if XOP2 exists) to avoid horizontal instructions like mm_unpacklo_epi64 .
The best solution in the general case is to push the registers onto the stack and use scalar arithmetic. Assuming you have two 256-bit registers x4 and y4, you can add them like this:
__m256i x4, y4, z4; uint64_t x[4], uint64_t y[4], uint64_t z[4] _mm256_storeu_si256((__m256i*)x, x4); _mm256_storeu_si256((__m256i*)y, y4); add_u256(x,y,z); z4 = _mm256_loadu_si256((__m256i*)z); void add_u256(uint64_t x[4], uint64_t y[4], uint64_t z[4]) { uint64_t c1 = 0, c2 = 0, tmp;
Edit: based on Stephen Canon's comment on saturated-substraction-avx-or-sse4-2 I found that there is a more efficient way to compare 64-bit unsigned numbers with SSE4.2 if XOP is not available.
__m128i a,b; __m128i sign64 = _mm_set1_epi64x(0x8000000000000000L); __m128i aflip = _mm_xor_si128(a, sign64); __m128i bflip = _mm_xor_si128(b, sign64); __m128i cmp = _mm_cmpgt_epi64(aflip,bflip);