If you expect the number of non-zero elements to be very low (i.e. much less than 1%), you can simply check every 16 byte chunk for a non-zero value:
int mask = _mm_movemask_epi8(_mm_cmpeq_epi8(reg, _mm_setzero_si128()); if (mask != 65535) {
If the percentage of good elements is small enough, the cost of incorrectly predicted branches and the cost of the slow scalar code inside the "if" will be negligible.
For a good general solution, first consider implementing SSE stream compression. It removes all null elements from the byte array (idea taken from here ):
__m128i shuf [65536]; //must be precomputed char cnt [65536]; //must be precomputed int compress(const char *src, int len, char *dst) { char *ptr = dst; for (int i = 0; i < len; i += 16) { __m128i reg = _mm_load_si128((__m128i*)&src[i]); __m128i zeroMask = _mm_cmpeq_epi8(reg, _mm_setzero_si128()); int mask = _mm_movemask_epi8(zeroMask); __m128i compressed = _mm_shuffle_epi8(reg, shuf[mask]); _mm_storeu_si128((__m128i*)ptr, compressed); ptr += cnt[mask]; //alternative: ptr += 16-_mm_popcnt_u32(mask); } return ptr - dst; }
As you can see, ( _mm_shuffle_epi8 + lookup table) can work wonders. I donβt know any other way to vectorize structurally complex code, such as stream compression.
Now, the only remaining problem with your query is that you want to get indexes. Each index must be stored in a 4-byte value, so a block of 16 input bytes can produce up to 64 output bytes that do not fit into a single SSE register.
One way to handle this is to honestly decompress the output into 64 bytes. Thus, you replace reg with a constant (0,1,2,3,4, ..., 15) in the code, then unpack the SSE register into 4 registers and add a register with four i values. This will require much larger instructions: 6 unpacking instructions, 4 additions and 3 magazines (one already exists). For me, this is a huge overhead, especially if you expect less than 25% non-zero elements.
Alternatively, you can limit the number of nonzero bytes processed by the iteration of the same name to 4, so that only one register is always needed for output. Here is a sample code:
__m128i shufMask [65536]; //must be precomputed char srcMove [65536]; //must be precomputed char dstMove [65536]; //must be precomputed int compress_ids(const char *src, int len, int *dst) { const char *ptrSrc = src; int *ptrDst = dst; __m128i offsets = _mm_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15); __m128i base = _mm_setzero_si128(); while (ptrSrc < src + len) { __m128i reg = _mm_loadu_si128((__m128i*)ptrSrc); __m128i zeroMask = _mm_cmpeq_epi8(reg, _mm_setzero_si128()); int mask = _mm_movemask_epi8(zeroMask); __m128i ids8 = _mm_shuffle_epi8(offsets, shufMask[mask]); __m128i ids32 = _mm_unpacklo_epi16(_mm_unpacklo_epi8(ids8, _mm_setzero_si128()), _mm_setzero_si128()); ids32 = _mm_add_epi32(ids32, base); _mm_storeu_si128((__m128i*)ptrDst, ids32); ptrDst += dstMove[mask]; //alternative: ptrDst += min(16-_mm_popcnt_u32(mask), 4); ptrSrc += srcMove[mask]; //no alternative without LUT base = _mm_add_epi32(base, _mm_set1_epi32(dstMove[mask])); } return ptrDst - dst; }
One of the drawbacks of this approach is that now each subsequent iteration of the loop cannot begin until the line ptrDst += dstMove[mask]; will not be executed in the previous iteration. Thus, the critical path has increased dramatically. Hardware hyper-threading or its manual emulation can remove this punishment.
So, as you can see, there are many variations of this basic idea that solve your problem with varying degrees of effectiveness. You can also reduce the size of the LUT if you don't like it (again, due to reduced throughput performance).
This approach cannot be completely expanded to wider registers (i.e., AVX2 and AVX-512), but you can try to combine several consecutive iteration instructions into one AVX2 or AVX-512 instruction, thereby slightly increasing throughput.
Note. I have not tested any code (since precalculating the LUT correctly requires noticeable effort).