SSE intrinsics - comparison if / else optimization

I am trying to optimize some code that processes raw pixel data. The implementation of C ++ code is currently too slow, so I'm trying to make some reason using the built-in SSE functions (SSE / 2/3 not using 4), with MSVC 2008. Given this, my first time delves into this low. have made some progress.

Unfortunately, I came up with a specific piece of code that made me stuck:

//Begin bad/suboptimal SSE code __m128i vnMask = _mm_set1_epi16(0x0001); __m128i vn1 = _mm_and_si128(vnFloors, vnMask); for(int m=0; m < PBS_SSE_PIXELS_PROCESS_AT_ONCE; m++) { bool bIsEvenFloor = vn1.m128i_u16[m]==0; vnPxChroma.m128i_u16[m] = m%2==0 ? (bIsEvenFloor ? vnPxCeilChroma.m128i_u16[m] : vnPxFloorChroma.m128i_u16[m]) : (bIsEvenFloor ? vnPxFloorChroma.m128i_u16[m] : vnPxCeilChroma.m128i_u16[m]); } 

I currently do not use the C ++ implementation for this section because I cannot understand how this can be optimized using SSE. I find SSE internals to compare with being a bit complicated.

Any suggestions / tips would be highly appreciated.

EDIT: Equivalent C ++ code that processes one pixel at a time will be:

 short pxCl=0, pxFl=0; short uv=0; // chroma component of pixel short y=0; // luma component of pixel for(int i = 0; i < end-of-line, ++i) { //Initialize pxCl, and pxFL //... bool bIsEvenI = (i%2)==0; bool bIsEvenFloor = (m_pnDistancesFloor[i] % 2)==0; uv = bIsEvenI ==0 ? (bIsEvenFloor ? pxCl : pxFl) : (bIsEvenFloor ? pxFl : pxCl); //Merge the Y/UV of the pixel; //... } 

Basically, I am doing non-linear edge stretching from 4: 3 to 16: 9.

+8
c ++ sse intrinsics
source share
2 answers

Ok, so I don’t know what this code does, however I know that you are asking how to optimize ternery statements and get this part of the code that works only in SSE. As a first step, I would recommend trying the whole flags and multiplication approach to avoid the conditional operator. For example:

This section

 for(int m=0; m < PBS_SSE_PIXELS_PROCESS_AT_ONCE; m++) { bool bIsEvenFloor = vn1.m128i_u16[m]==0; vnPxChroma.m128i_u16[m] = m%2==0 ? (bIsEvenFloor ? vnPxCeilChroma.m128i_u16[m] : vnPxFloorChroma.m128i_u16[m]) : (bIsEvenFloor ? vnPxFloorChroma.m128i_u16[m] : vnPxCeilChroma.m128i_u16[m]); } 

Syntactically equivalent to this

 // DISCLAIMER: Untested both in compilation and execution // Process all m%2=0 in steps of 2 for(int m=0; m < PBS_SSE_PIXELS_PROCESS_AT_ONCE; m+=2) { // This line could surely pack muliple u16s into one SSE2 register uint16 iIsOddFloor = vn1.m128i_u16[m] & 0x1 // If u16[m] == 0, result is 0 uint16 iIsEvenFloor = iIsOddFloor ^ 0x1 // Flip 1 to 0, 0 to 1 // This line could surely perform an SSE2 multiply across multiple registers vnPxChroma.m128i_u16[m] = iIsEvenFloor * vnPxCeilChroma.m128i_u16[m] + iIsOddFloor * vnPxFloorChroma.m128i_u16[m] } // Process all m%2!=0 in steps of 2 for(int m=1; m < PBS_SSE_PIXELS_PROCESS_AT_ONCE; m+=2) { uint16 iIsOddFloor = vn1.m128i_u16[m] & 0x1 // If u16[m] == 0, result is 0 uint16 iIsEvenFloor = iIsOddFloor ^ 0x1 // Flip 1 to 0, 0 to 1 vnPxChroma.m128i_u16[m] = iIsEvenFloor * vnPxFloorChroma.m128i_u16[m] + iIsOddFloor * vnPxCeilChroma.m128i_u16[m] } 

Basically, if you divide into two loops, you will lose the performance increase of sequential memory access, but leave the operation modulo and two conditional statements.

Now you say that you noticed that there are two logical operators in each cycle, and the factors that I could add are not built-in implementations of SSE. What is stored in your vn1.m123i_u16 [] array? Are these only zeros and ones? If so, you do not need this part and it can end it. If not, can you normalize your data in this array only for zeros and ones? If the array vn1.m123i_u16 contains only one and zeros, then this code becomes

 uint16 iIsOddFloor = vn1.m128i_u16[m] uint16 iIsEvenFloor = iIsOddFloor ^ 0x1 // Flip 1 to 0, 0 to 1 

You will also notice that I do not use SSE multiplication to execute isEvenFloor * vnPx... part , as well as to store the iIsEvenFloor and iIsOddFloor . I'm sorry that I don’t remember that SSE-intrinsics for u16 multiplies / registers from above, but, nevertheless, I hope this approach will be useful. Some optimizations you should pay attention to:

 // This line could surely pack muliple u16s into one SSE2 register uint16 iIsOddFloor = vn1.m128i_u16[m] & 0x1 // If u16[m] == 0, result is 0 uint16 iIsEvenFloor = iIsOddFloor ^ 0x1 // Flip 1 to 0, 0 to 1 // This line could surely perform an SSE2 multiply across multiple registers vnPxChroma.m128i_u16[m] = iIsEvenFloor * vnPxCeilChroma.m128i_u16[m] + iIsOddFloor * vnPxFloorChroma.m128i_u16[m] 

In this section of the code that you published, and my modification, we still do not fully use the built-in functions of SSE1 / 2/3, but it can give some points on how this can be done (how to vectorize the code).

Finally, I would say to check everything. Run the above code unchanged and profile before making changes and profiling again. Actual performance may surprise you!


Update 1 :

I went through the Intel SIMD Intrinsics documentation to select the appropriate features that might be useful for this. In particular, take a look at the bitwise XOR, AND, and MULT / ADD

__ m128 data types
The __m128i data type can contain sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit integer values.

__ m128i _mm_add_epi16 (__ m128i a, __m128i b)
Add 8 signed or unsigned 16-bit integers to in 8 signed or unsigned 16-bit integers to b

__ m128i _mm_mulhi_epu16 (__ m128i a, __m128i b)
Multiplies 8 unsigned 16-bit integers from a by 8-digit 16-bit integers from b. Packs the top 16-bit 8-digit 32-bit results

R0 = hiword (a0 * b0)
R1 = hiword (a1 * b1)
R2 = hiword (a2 * b2)
R3 = hiword (a3 * b3)
..
R7 = hiword (a7 * b7)

__ m128i _mm_mullo_epi16 (__ m128i a, __m128i b)
Multiplies 8 signed or unsigned 16-bit integers from a by 8-signed or unsigned 16-bit integers from b. Packs the top 16-bit 8-signed or unsigned 32-bit results

R0 = loword (a0 * b0)
R1 = loword (a1 * b1)
R2 = loword (a2 * b2)
R3 = loword (a3 * b3)
..
R7 = loword (a7 * b7)

__ m128i _mm_and_si128 (__ m128i a, __m128i b)
Perform a bitwise AND 128-bit value in m1 with a 128-bit value in m2.

__ m128i _mm_andnot_si128 (__ m128i a, __m128i b)
Computes the bitwise AND of a 128-bit value in b and the bitwise AND 128-bit value in a.

__ m128i _mm_xor_si128 (__ m128i a, __m128i b)
Perform a bitwise XOR of a 128-bit value in m1 with a 128-bit value in m2.

ALSO from your sample code for reference

uint16 u1 = u2 = u3 ... = u15 = 0x1
__m128i vnMask = _mm_set1_epi16 (0x0001); // Sets 8-digit 16-bit integer values.

uint16 vn1 [i] = vnFloors [i] and 0x1
__m128i vn1 = _mm_and_si128 (vnFloors, vnMask); // Calculates the bitwise AND of a 128-bit value in and the 128-bit value in b.

+7
source share

Andrew, your suggestions lead me to an almost optimal solution.

Using a combination of truth table and karn map, I found that the code

  uv = bIsEvenI ==0 ? (bIsEvenFloor ? pxCl : pxFl) : (bIsEvenFloor ? pxFl : pxCl); 

reduces to a! xor (not xor). From now on, I was able to use the SSE vector to optimize the solution:

 //Use the mask with bit AND to check if even/odd __m128i vnMask = _mm_set1_epi16(0x0001); //Set the bit to '1' if EVEN, else '0' __m128i vnFloorsEven = _mm_andnot_si128(vnFloors, vnMask); __m128i vnMEven = _mm_set_epi16 ( 0, //m==7 1, 0, 1, 0, 1, 0, //m==1 1 //m==0 ); // Bit XOR the 'floor' values and 'm' __m128i vnFloorsXorM = _mm_xor_si128(vnFloorsEven, vnMEven); // Now perform our bit NOT __m128i vnNotFloorsXorM = _mm_andnot_si128(vnFloorsXorM, vnMask); // This is the C++ ternary replacement - using multipilaction __m128i vnA = _mm_mullo_epi16(vnNotFloorsXorM, vnPxFloorChroma); __m128i vnB = _mm_mullo_epi16(vnFloorsXorM, vnPxCeilChroma); // Set our pixels - voila! vnPxChroma = _mm_add_epi16(vnA, vnB); 

Thanks for the help...

+2
source share

All Articles