You can use comparison and then extract the mask from the comparison result:
__m128i vcmp = _mm_cmpeq_epi8(v0, v1); // PCMPEQB uint16_t vmask = _mm_movemask_epi8(vcmp); // PMOVMSKB if (vmask == 0xffff) { // v0 == v1 }
This works with SSE2 and later.
As @Zboson noted, if you have SSE 4.1, you can do it like this, which can be a little more efficient since these are two SSE instructions and then a flag test (ZF):
__m128i vcmp = _mm_xor_si128(v0, v1); // PXOR if (_mm_testz_si128(vcmp, vcmp)) // PTEST (requires SSE 4.1) { // v0 == v1 }
FWIW I just compared both of these implementations on the Haswell Core i7, using clang to compile the test bundle, and the synchronization results were very similar: the SSE4 implementation looks very slightly faster, but it's hard to measure the difference.
Paul r
source share