You can vectorize the comparison with _mm_cmplt_epi8 and _mm_cmpgt_epi8 (msvc intrinsics).
Then you can use movemask as a result of ANDing the comparison results. If the result of movemask is 0xFFFF, all comparisons are passed. Otherwise, you need to run the tail contour to find out the correct position that failed the test. You can understand this from the mask, but depending on the value of "len", this may not be worth the effort.
An original unclaimed tail cycle is also required if len is not a multiple of 16. This may or may not be faster - you will need to check it to be sure.
what - comparison works with signed values โโand it doesn't work.
The working version is below.
union UmmU8 { __m128i mm_; struct { unsigned char u8_; }; }; int f(int len, unsigned char *p) { int i = 0; __m128i A; __m128i B; __m128i C; UmmU8* pu = (UmmU8*)p; int const len16 = len / 16; while (i < len16) { A = pu[i].mm_; B = _mm_slli_epi32(A, 1); C = _mm_slli_epi32(A, 2); B = _mm_or_si128(B, C); A = _mm_andnot_si128(A, B); int mask = _mm_movemask_epi8(A); if (mask == 0xFFFF) { ++i; } else { if (mask == 0) { return i * 16; } break; } } i *= 16; while (i < len && p[i] >= 32 && p[i] <= 127) { i++; } return i; }
Since I donโt have 64 OSs on this PC, I canโt do the right perfective test. However, the profiling run gave:
- naive cycle: 30.44
- 64-bit integer: 15.22 (on a 32-bit OS)
- SSE impl: 5.21
Thus, the SSE version is much faster than the naive loop version. I would expect the 64-bit version to work much better on a 64-bit system - the difference between SSE and 64-bit versions may be insignificant.
source share