If AVX2 is acceptable (with earlier versions it didn’t work out so well, but you can still do something there), you can search in many places at once. I could not test this on my machine (only compile), so the following gives you an idea of how you can approach it than copy & paste code, so I will try to explain this, and not just dump the code.
The main idea is to read uint64_t , shift it to the right by all the values that make sense (from 0 to 7), then for each of these 8 new uint64_t , check to see if there is a byte in it. A slight complication: for uint64_t shifted by more than 0, the top position should not be taken into account, since it has zeros shifted into it, which may not be in the actual data. Once this is done, the next uint64_t should be read at offset 7 from the current one, otherwise there will be a border that is not checked. These subtle, albeit low loads are not so bad, especially if they are small.
So, now for some (unverified and incomplete, see below) code,
__m256i needle = _mm256_set1_epi8(find); size_t i; for (i = 0; i < n - 6; i += 7) { // unaligned load here, but that OK uint64_t d = *(uint64_t*)(data + i); __m256i x = _mm256_set1_epi64x(d); __m256i low = _mm256_srlv_epi64(x, _mm256_set_epi64x(3, 2, 1, 0)); __m256i high = _mm256_srlv_epi64(x, _mm256_set_epi64x(7, 6, 5, 4)); low = _mm256_cmpeq_epi8(low, needle); high = _mm256_cmpeq_epi8(high, needle); // in the qword right-shifted by 0, all positions are valid // otherwise, the top position corresponds to an incomplete byte uint32_t lowmask = 0x7f7f7fffu & _mm256_movemask_epi8(low); uint32_t highmask = 0x7f7f7f7fu & _mm256_movemask_epi8(high); uint64_t mask = lowmask | ((uint64_t)highmask << 32); if (mask) { int bitindex = __builtin_ffsl(mask); // the bit-index and byte-index are swapped return 8 * (i + (bitindex & 7)) + (bitindex >> 3); } }
The funny “bit-index and byte-index change” - this is because the search in qword is performed by byte by bytes, and the results of these comparisons end with 8 adjacent bits, while the search is “shifted by 1”, ends in the next 8 bits and so on . Thus, in the resulting masks, the index of the byte that contains 1 is a bit offset, but the bit index in this byte is actually a byte offset, for example, 0x8000 will correspond to a byte search in the 7th byte qword shifted to the right by 1, therefore, the actual index is 8 * 7 + 1.
There is also a tail problem, part of the data remaining after processing all 7-byte blocks. This can be done the same way, but now more positions contain false bytes. Now there are n - i bytes left, so the low byte should have the n - i bit, and for all other bytes - less. For the same reason as the previous ones, zeros are shifted in other positions. In addition, if there is only 1 byte “left”, this is not entirely true, because it would have already been tested, but it does not really matter. I assume that the data is sufficiently augmented that accessing outside does not matter. Here it is not verified:
if (i < n - 1) { // make ni-1 bits, then copy them to every byte uint32_t validh = ((1u << (n - i - 1)) - 1) * 0x01010101; // the lowest position has an extra valid bit, set lowest zero uint32_t validl = (validh + 1) | validh; uint64_t d = *(uint64_t*)(data + i); __m256i x = _mm256_set1_epi64x(d); __m256i low = _mm256_srlv_epi64(x, _mm256_set_epi64x(3, 2, 1, 0)); __m256i high = _mm256_srlv_epi64(x, _mm256_set_epi64x(7, 6, 5, 4)); low = _mm256_cmpeq_epi8(low, needle); high = _mm256_cmpeq_epi8(high, needle); uint32_t lowmask = validl & _mm256_movemask_epi8(low); uint32_t highmask = validh & _mm256_movemask_epi8(high); uint64_t mask = lowmask | ((uint64_t)highmask << 32); if (mask) { int bitindex = __builtin_ffsl(mask); return 8 * (i + (bitindex & 7)) + (bitindex >> 3); } }