How to optimize the cycle?

I have the following bottleneck function.

typedef unsigned char byte; void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3) { const byte b1 = 128-30; const byte b2 = 128+30; for (const byte * p1 = p1Start; p1 != p1End; ++p1, ++p2, ++p3) { *p3 = (*p1 < *p2 ) ? b1 : b2; } } 

I want to replace C++ with SSE2 built-in functions. I tried _mm_cmpgt_epi8 , but it used a signed comparison. I need an unsigned comparison.

Is there any trick (SSE, SSE2, SSSE3) to solve my problem?

Note: In this case, I do not want to use multithreading.

+6
c ++ optimization assembly intrinsics sse2
source share
6 answers

Instead of compensating for your signed values ​​to make them unsigned, a slightly more efficient way would be to do the following:

  • use _mm_min_epu8 to get the unsigned minimum p1, p2
  • compare this minimum for equality with p2 using _mm_cmpeq_epi8
  • the resulting mask will now be 0x00 for elements, where p1 <p2 and 0xff for elements, where p1> = p2
  • you can now use this mask with _mm_or_si128 and _mm_andc_si128 to select the appropriate b1 / b2 values

Please note that these are only 4 commands, compared to 5, using the + offset sign comparison method.

+9
source share

You can subtract 127 from your numbers and then use _mm_cmpgt_epi8

+2
source share

Yes, this can be done in SIMD, but a few steps will be required to create a mask.

I think Ruslik understood this correctly. You want each component with 0x80 to display the meaning of a signed and unsigned comparison. _mm_xor_si128 ( PXOR ) gets this for you - you will need to create a mask as a static char array somewhere before loading it into the SIMD register. Then _mm_cmpgt_epi8 gets you a mask, and you can use bitwise AND (for example, _mm_and_si128 ) to perform a masked move.
+2
source share

Yes, SSE does not work here. You can improve the performance of this code on a multi-core computer using OpenMP:

  void CompareArrays (const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
 {
      const byte b1 = 128-30;
      const byte b2 = 128 + 30;

      int n = p1End - p1Start;
      #pragma omp parallel for
      for (int i = 0; i <n; ++ p1, ++ i) 
      {
         p3 [i] = (p1 [i] <p2 [i])?  b1: b2;
      }
 }
+1
source share

Unfortunately, many of the answers above are incorrect. Let a 3-bit word be assumed:

unsigned: 4 5 6 7 0 1 2 3 == signed: -4 -3 -2 -1 0 1 2 3 (bit: 100 101 110 111 000 001 010 011)

Paul R's method is incorrect. Suppose we want to know if 3> 2. min (3,2) == 2, which says yes, so the method works here. Now suppose we want to know if 7> 2. The value 7 is -1 in the signed view, therefore min (-1,2) == -1, which mistakenly assumes that 7 does not exceed 2 unsigned.

Andrey’s method is also incorrect. Suppose we want to know that 7> 2, or = 7, and b = 2. The value 7 is -1 in the signed view, so the first member (a> b) fails, and the method assumes that 7 is no more than 2.

However, the BJobnh method corrected by Alexei is correct. Just subtract 2 ^ (n-1) from the values, where n is the number of bits. In this case, we would subtract 4 to get the new corresponding values:

old signed: -4 -3 -2 -1 0 1 2 3 => new signed: 0 1 2 3 -4 -3 -2 -1 == new unsigned 0 1 2 3 4 5 6 7.

In other words, unsigned_greater_than (a, b) is equivalent to sign_greater_than (a - 2 ^ (n-1), b - 2 ^ (n-1)).

-one
source share

use pcmpeqb and be Strength with you.

-3
source share

All Articles