Is this feature a good candidate for SIMD on Intel?

I am trying to optimize the following function (a bit simplified, but this is a loop in which my program spends a lot of time):

int f(int len, unsigned char *p) { int i = 0; while (i < len && p[i] >= 32 && p[i] <= 127) { i++; } return i; } 

I think it can be optimized using vector instructions, but from a small amount of research it seems that SSE is not designed to work at the byte level. This program is only for Intel 64-bit OSX processors. Is there a clever bit trick that I don't see that will allow me to work on 64 bits at a time? llvm with -O3 does not make any smart optimizations.

UPDATE:

The SIMD code is usually the fastest in my testing (depending on the size of the input), but for some reason the application as a whole works more slowly with SIMD than with a naive code or a trick with a bat. For context, the application finds the length of the subsequences of ASCII strings in the input stream to the terminal emulator. ASCII strings receive special "fast path" processing. I could only mark one answer as correct, but both were great. I made a slight improvement for bit twisting by removing the if statement by doing the following:

  while (i < len - 8) { uint64_t bytes = *(uint64_t *)(p + i); uint64_t middleBits = bytes & 0x6060606060606060; uint64_t highBits = bytes & 0x8080808080808080; middleBits |= (middleBits >> 1); middleBits &= ~(highBits >> 2); if ((middleBits & 0x2020202020202020) != 0x2020202020202020) { break; } i += 8; } 
+6
source share
3 answers

You can vectorize the comparison with _mm_cmplt_epi8 and _mm_cmpgt_epi8 (msvc intrinsics).

Then you can use movemask as a result of ANDing the comparison results. If the result of movemask is 0xFFFF, all comparisons are passed. Otherwise, you need to run the tail contour to find out the correct position that failed the test. You can understand this from the mask, but depending on the value of "len", this may not be worth the effort.

An original unclaimed tail cycle is also required if len is not a multiple of 16. This may or may not be faster - you will need to check it to be sure.

what - comparison works with signed values โ€‹โ€‹and it doesn't work.

The working version is below.

 union UmmU8 { __m128i mm_; struct { unsigned char u8_; }; }; int f(int len, unsigned char *p) { int i = 0; __m128i A; __m128i B; __m128i C; UmmU8* pu = (UmmU8*)p; int const len16 = len / 16; while (i < len16) { A = pu[i].mm_; B = _mm_slli_epi32(A, 1); C = _mm_slli_epi32(A, 2); B = _mm_or_si128(B, C); A = _mm_andnot_si128(A, B); int mask = _mm_movemask_epi8(A); if (mask == 0xFFFF) { ++i; } else { if (mask == 0) { return i * 16; } break; } } i *= 16; while (i < len && p[i] >= 32 && p[i] <= 127) { i++; } return i; } 

Since I donโ€™t have 64 OSs on this PC, I canโ€™t do the right perfective test. However, the profiling run gave:

  • naive cycle: 30.44
  • 64-bit integer: 15.22 (on a 32-bit OS)
  • SSE impl: 5.21

Thus, the SSE version is much faster than the naive loop version. I would expect the 64-bit version to work much better on a 64-bit system - the difference between SSE and 64-bit versions may be insignificant.

+4
source

I'm not sure if this is the answer to your question, or if it will speed up your code significantly, but this idea comes to my mind. Since 32 is 2 ^ 5, if the byte is between 32 and 128, it should have the 6th or 7th bit and the 8th bit. You can extend testing to 64-bit integers by giving me code like:

 // check whether each byte is in range 32 - 128. unsigned bytesInRange(unsigned long long x) { unsigned long long y, z; if ((x & 0x8080808080808080LL) != 0) return(0); y = x >> 1; z = x | y; if ((z & 0x2020202020202020LL) == 0x2020202020202020LL) return(1); return(0); } int f(int len, unsigned char *p) { int i = 0; int len8 = len / 8; unsigned long long *q = (unsigned long long *) p; while (i < len8 && bytesInRange(q[i])) { i++; } i = i * 8; while (i < len && p[i] >= 32 && p[i] <= 127) { i++; } return i; } 

For architectures where alignment is required, it must be checked before the first cycle.

+5
source

I tried several approaches to the problem: based on SSE2 and SSE4.2. String operations from SSE4.2 are rather slow, SSE2 versions outperform them easily. Note that, on the whole, the best solution largely depends on the average value of the expected response.

Below is one of the best solutions for answer <= 400 :

 //SSE2 vectorization by stgatilov: no unrolling, fast BSF tail int CommonAsciiLength_sse2_end(int len, unsigned char *p) { const __m128i *ptr = (const __m128i *)p; int blocks = len >> 4; int cnt; for (cnt = 0; cnt < blocks; cnt++) { __m128i mask = _mm_cmplt_epi8(ptr[cnt], _mm_set1_epi8(32)); int val = _mm_movemask_epi8(mask); if (val) return 16 * cnt + __builtin_ctz(val); } __m128i mask = _mm_cmplt_epi8(ptr[cnt], _mm_set1_epi8(32)); int val = _mm_movemask_epi8(mask); val |= -(1 << (len - 16 * cnt)); return 16 * cnt + __builtin_ctz(val); } 

Note that for larger answers, this solution also benefits from deployment.

Here are a few timings for different solutions and different answer lengths. Measured on Ivy Bridge. Note that it only makes sense to compare the timings for one run, comparing between runs with different averages. the answer is incorrect.

 All checked. Average answer = 7.0 Time = 4.879 (1884680192) original Time = 6.021 (1884680192) bitmask Time = 5.205 (1884680192) Pete Time = 5.094 (1884680192) sse2 Time = 5.301 (1884680192) sse2_x4 Time = 1.603 (1884680192) sse42 Time = 1.235 (1884680192) sse2_end Time = 2.319 (1884680192) sse2_x4_end ========================================= All checked. Average answer = 47.0 Time = 5.825 (-1867343006) original Time = 4.792 (-1867343006) bitmask Time = 4.490 (-1867343006) Pete Time = 4.327 (-1867343006) sse2 Time = 5.260 (-1867343006) sse2_x4 Time = 3.347 (-1867343006) sse42 Time = 2.505 (-1867343006) sse2_end Time = 3.008 (-1867343006) sse2_x4_end ========================================= All checked. Average answer = 151.4 Time = 4.372 (-2086294174) original Time = 2.150 (-2086294174) bitmask Time = 1.662 (-2086294174) Pete Time = 1.492 (-2086294174) sse2 Time = 2.249 (-2086294174) sse2_x4 Time = 1.649 (-2086294174) sse42 Time = 0.986 (-2086294174) sse2_end Time = 1.398 (-2086294174) sse2_x4_end ========================================= All checked. Average answer = 426.8 Time = 3.772 (1814680269) original Time = 1.320 (1814680269) bitmask Time = 0.830 (1814680269) Pete Time = 0.692 (1814680269) sse2 Time = 0.870 (1814680269) sse2_x4 Time = 1.186 (1814680269) sse42 Time = 0.531 (1814680269) sse2_end Time = 0.573 (1814680269) sse2_x4_end ========================================= All checked. Average answer = 1083.4 Time = 2.788 (358018991) original Time = 0.819 (358018991) bitmask Time = 0.443 (358018991) Pete Time = 0.344 (358018991) sse2 Time = 0.347 (358018991) sse2_x4 Time = 0.813 (358018991) sse42 Time = 0.297 (358018991) sse2_end Time = 0.256 (358018991) sse2_x4_end 

The full code of all solutions, along with testing, is available here .

+2
source

All Articles