Ok, so I donβt know what this code does, however I know that you are asking how to optimize ternery statements and get this part of the code that works only in SSE. As a first step, I would recommend trying the whole flags and multiplication approach to avoid the conditional operator. For example:
This section
for(int m=0; m < PBS_SSE_PIXELS_PROCESS_AT_ONCE; m++) { bool bIsEvenFloor = vn1.m128i_u16[m]==0; vnPxChroma.m128i_u16[m] = m%2==0 ? (bIsEvenFloor ? vnPxCeilChroma.m128i_u16[m] : vnPxFloorChroma.m128i_u16[m]) : (bIsEvenFloor ? vnPxFloorChroma.m128i_u16[m] : vnPxCeilChroma.m128i_u16[m]); }
Syntactically equivalent to this
// DISCLAIMER: Untested both in compilation and execution // Process all m%2=0 in steps of 2 for(int m=0; m < PBS_SSE_PIXELS_PROCESS_AT_ONCE; m+=2) { // This line could surely pack muliple u16s into one SSE2 register uint16 iIsOddFloor = vn1.m128i_u16[m] & 0x1 // If u16[m] == 0, result is 0 uint16 iIsEvenFloor = iIsOddFloor ^ 0x1 // Flip 1 to 0, 0 to 1 // This line could surely perform an SSE2 multiply across multiple registers vnPxChroma.m128i_u16[m] = iIsEvenFloor * vnPxCeilChroma.m128i_u16[m] + iIsOddFloor * vnPxFloorChroma.m128i_u16[m] } // Process all m%2!=0 in steps of 2 for(int m=1; m < PBS_SSE_PIXELS_PROCESS_AT_ONCE; m+=2) { uint16 iIsOddFloor = vn1.m128i_u16[m] & 0x1 // If u16[m] == 0, result is 0 uint16 iIsEvenFloor = iIsOddFloor ^ 0x1 // Flip 1 to 0, 0 to 1 vnPxChroma.m128i_u16[m] = iIsEvenFloor * vnPxFloorChroma.m128i_u16[m] + iIsOddFloor * vnPxCeilChroma.m128i_u16[m] }
Basically, if you divide into two loops, you will lose the performance increase of sequential memory access, but leave the operation modulo and two conditional statements.
Now you say that you noticed that there are two logical operators in each cycle, and the factors that I could add are not built-in implementations of SSE. What is stored in your vn1.m123i_u16 [] array? Are these only zeros and ones? If so, you do not need this part and it can end it. If not, can you normalize your data in this array only for zeros and ones? If the array vn1.m123i_u16 contains only one and zeros, then this code becomes
uint16 iIsOddFloor = vn1.m128i_u16[m] uint16 iIsEvenFloor = iIsOddFloor ^ 0x1 // Flip 1 to 0, 0 to 1
You will also notice that I do not use SSE multiplication to execute isEvenFloor * vnPx... part , as well as to store the iIsEvenFloor and iIsOddFloor . I'm sorry that I donβt remember that SSE-intrinsics for u16 multiplies / registers from above, but, nevertheless, I hope this approach will be useful. Some optimizations you should pay attention to:
// This line could surely pack muliple u16s into one SSE2 register uint16 iIsOddFloor = vn1.m128i_u16[m] & 0x1 // If u16[m] == 0, result is 0 uint16 iIsEvenFloor = iIsOddFloor ^ 0x1 // Flip 1 to 0, 0 to 1 // This line could surely perform an SSE2 multiply across multiple registers vnPxChroma.m128i_u16[m] = iIsEvenFloor * vnPxCeilChroma.m128i_u16[m] + iIsOddFloor * vnPxFloorChroma.m128i_u16[m]
In this section of the code that you published, and my modification, we still do not fully use the built-in functions of SSE1 / 2/3, but it can give some points on how this can be done (how to vectorize the code).
Finally, I would say to check everything. Run the above code unchanged and profile before making changes and profiling again. Actual performance may surprise you!
Update 1 :
I went through the Intel SIMD Intrinsics documentation to select the appropriate features that might be useful for this. In particular, take a look at the bitwise XOR, AND, and MULT / ADD
__ m128 data types
The __m128i data type can contain sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit integer values.
__ m128i _mm_add_epi16 (__ m128i a, __m128i b)
Add 8 signed or unsigned 16-bit integers to in 8 signed or unsigned 16-bit integers to b
__ m128i _mm_mulhi_epu16 (__ m128i a, __m128i b)
Multiplies 8 unsigned 16-bit integers from a by 8-digit 16-bit integers from b. Packs the top 16-bit 8-digit 32-bit results
R0 = hiword (a0 * b0)
R1 = hiword (a1 * b1)
R2 = hiword (a2 * b2)
R3 = hiword (a3 * b3)
..
R7 = hiword (a7 * b7)
__ m128i _mm_mullo_epi16 (__ m128i a, __m128i b)
Multiplies 8 signed or unsigned 16-bit integers from a by 8-signed or unsigned 16-bit integers from b. Packs the top 16-bit 8-signed or unsigned 32-bit results
R0 = loword (a0 * b0)
R1 = loword (a1 * b1)
R2 = loword (a2 * b2)
R3 = loword (a3 * b3)
..
R7 = loword (a7 * b7)
__ m128i _mm_and_si128 (__ m128i a, __m128i b)
Perform a bitwise AND 128-bit value in m1 with a 128-bit value in m2.
__ m128i _mm_andnot_si128 (__ m128i a, __m128i b)
Computes the bitwise AND of a 128-bit value in b and the bitwise AND 128-bit value in a.
__ m128i _mm_xor_si128 (__ m128i a, __m128i b)
Perform a bitwise XOR of a 128-bit value in m1 with a 128-bit value in m2.
ALSO from your sample code for reference
uint16 u1 = u2 = u3 ... = u15 = 0x1
__m128i vnMask = _mm_set1_epi16 (0x0001); // Sets 8-digit 16-bit integer values.
uint16 vn1 [i] = vnFloors [i] and 0x1
__m128i vn1 = _mm_and_si128 (vnFloors, vnMask); // Calculates the bitwise AND of a 128-bit value in and the 128-bit value in b.