Optimization, branch elimination

float mixValue = ... //in range -1.0f to 1.0f for(... ; ... ; ... ) //long loop { float inputLevel = ... //in range -1.0f to 1.0f if(inputLevel < 0.0 && mixValue < 0.0) { mixValue = (mixValue + inputLevel) + (mixValue*inputLevel); } else { mixValue = (mixValue + inputLevel) - (mixValue*inputLevel); } } 

just a question, can we calculate mixValue without branching ? or any other optimization suggestion, such as using SIMD?

edit: just for more information, I ended up using this solution based on the selected answer:

 const float sign[] = {-1, 1}; float mixValue = ... //in range -1.0f to 1.0f for(... ; ... ; ... ) //long loop { float inputLevel = ... //in range -1.0f to 1.0f unsigned a = *(unsigned*)(&mixValue); unsigned b = *(unsigned*)(&inputLevel); float mulValue = mixValue * inputLevel * sign[(a & b) >> (8*sizeof(unsigned)-1)]; float addValue = mixValue + inputLevel; mixValue = addValue + mulValue; } 

thanks.

+4
source share
8 answers

Inspired by Rock's answer (which is on MSVC ++ 10 branches), this does not look like a branch:

 #include <iostream> using namespace std; const float sign[] = {-1, 1}; int main() { const int N = 10; float mixValue = -0.5F; for(int i = 0; i < N; i++) { volatile float inputLevel = -0.3F; int bothNegative = ((((unsigned char*)&inputLevel)[3] & 0x80) & (((unsigned char*)&mixValue)[3] & 0x80)) >> 7; mixValue = (mixValue + inputLevel) + (sign[bothNegative]*mixValue*inputLevel); } std::cout << mixValue << std::endl; } 

Here is the disassembly analyzed using IDA Pro (compiled in MSVC ++ 10, Release mode):

Dismantling http://img248.imageshack.us/img248/6865/floattestbranchmine.png

+1
source

How about this:

 const float sign[] = {-1, 1}; float mixValue = ... //in range -1.0f to 1.0f for(... ; ... ; ... ) //long loop { float inputLevel = ... //in range -1.0f to 1.0f int bothNegative = (inputLevel < 0.0) & (mixValue < 0.0); mixValue = (mixValue + inputLevel) + (sign[bothNegative]*mixValue*inputLevel); } 

Edit: Mike was right that && would introduce the branch and thanks to Pedro for his proof. I changed && and now GCC (version 4.4.0) generates code without branches.

+4
source
 float mixValue = ... //in range -1.0f to 1.0f for(... ; ... ; ... ) //long loop { float inputLevel = ... //in range -1.0f to 1.0f float mulValue = mixValue * inputLevel; float addValue = mixValue + inputLevel; __int32 a = *(__int32*)(&mixValue); __int32 b = *(__int32*)(&inputLevel); __int32 c = *(__int32*)(&mulValue); __int32 d = c & ((a ^ b) | 0x7FFFFFFF); mixValue = addValue + *(float*)(&d); } 
+1
source

Have you compared the loop with and without a branch?

At the very least, you can remove one part of the branch, since mixValue is out of loop.

 float multiplier(float a, float b){ unsigned char c1Neg = reinterpret_cast<unsigned char *>(&a)[3] & 0x80; unsigned char c2Neg = reinterpret_cast<unsigned char *>(&b)[3] & 0x80; unsigned char multiplierIsNeg = c1Neg & c2Neg; float one = 1; reinterpret_cast<unsigned char *>(&one)[3] |= multiplierIsNeg; return -one; } cout << multiplier(-1,-1) << endl; // +1 cout << multiplier( 1,-1) << endl; // -1 cout << multiplier( 1, 1) << endl; // -1 cout << multiplier(-1, 1) << endl; // -1 
0
source

If you are concerned about excessive branching, check out Duff Device . This should help to relax a bit. In truth, the unwinding cycle is what the optimizer will do, so trying to do it manually can be a waste of time. Check assembly output to find out.

SIMD will definitely help if you perform the same operation for each element in your array. Keep in mind that not all hardware supports SIMD, but some compilers, such as gcc, provide built-in functions for SIMD, which saves you from diving into assembler.

If you use gcc to compile ARM code, the built-in SIMD functions can be found here.

0
source

On top of my head (I'm sure it can be reduced):

mixValue = (mixValue + inputLevel) + (((mixValue / fabs(mixValue)) + (inputLevel / fabs(inputLevel))+1) / fabs(((mixValue / fabs(mixValue)) + (inputLevel / fabs(inputLevel))+1)))*-1*(mixValue*inputLevel);

To clarify a bit, I will calculate the sign separately:

 float sign = (((mixValue / fabs(mixValue)) + (inputLevel / fabs(inputLevel))+1) / fabs(((mixValue / fabs(mixValue)) + (inputLevel / fabs(inputLevel))+1)))*-1; mixValue = (mixValue + inputLevel) + sign*(mixValue*inputLevel); 

This is floating point math, so you probably have to fix some rounding problems, but that should put you on the right track, I think.

0
source

mixValue looking at your code, you will see that you will always add the absolute value of mixValue and inputLevel , unless both are positive.

With some knowledge of bit fiddling and IEEE floating point, you can get rid of the conditional:

 // sets the first bit of f to zero => makes it positive. void absf( float& f ) { assert( sizeof( float ) == sizeof( int ) ); reinterpret_cast<int&>( f ) &= ~0x80000000; } // returns a first-bit = 1 if f is positive int pos( float& f ) { return ~(reinterpret_cast<int&>(f) & 0x80000000) & 0x80000000; } // returns -fabs( f*g ) if f>0 and g>0, fabs(f*g) otherwise. float prod( float& f, float& g ) { float p = f*g; float& rp=p; int& ri = reinterpret_cast<int&>(rp); absf(p); ri |= ( pos(f) & pos(g) & 0x80000000); // first bit = + & + return p; } int main(){ struct T { float f, g, r; void test() { float p = prod(f,g); float d = (pr)/r; assert( -1e-15 < d && d < 1e-15 ); } }; T vals[] = { {1,1,-1},{1,-1,1},{-1,1,1},{-1,-1,1} }; for( T* val=vals; val != vals+4; ++val ) { val->test(); } } 

And finally: your cycle

 for( ... ) { mixedResult += inputLevel + prod(mixedResult,inputLevel); } 

Note. The size of your accumulation does not match. inputLevel is a dimensionless quantity, and mixedResult is your ... result (for example, in Pascal, in Volts, ...). You cannot add two sizes with different sizes. You probably want mixedResult += prod( mixedResult, inputLevel ) as your battery.

0
source

Some compilers (i.e. MSC) also require manual character validation.

A source:

 volatile float mixValue; volatile float inputLevel; float u = mixValue*inputLevel; float v = -u; float a[] = { v, u }; mixValue = (mixValue + inputLevel) + a[ (inputLevel<0.0) & (mixValue<0.0) ]; 

IntelC 11.1:

 movss xmm1, DWORD PTR [12+esp] mulss xmm1, DWORD PTR [16+esp] movss xmm6, DWORD PTR [12+esp] movss xmm2, DWORD PTR [16+esp] movss xmm3, DWORD PTR [16+esp] movss xmm5, DWORD PTR [12+esp] xorps xmm4, xmm4 movaps xmm0, xmm4 subss xmm0, xmm1 movss DWORD PTR [esp], xmm0 movss DWORD PTR [4+esp], xmm1 addss xmm6, xmm2 xor eax, eax cmpltss xmm3, xmm4 movd ecx, xmm3 neg ecx cmpltss xmm5, xmm4 movd edx, xmm5 neg edx and ecx, edx addss xmm6, DWORD PTR [esp+ecx*4] movss DWORD PTR [12+esp], xmm6 

gcc 4.5:

 flds 32(%esp) flds 16(%esp) fmulp %st, %st(1) fld %st(0) fchs fstps (%esp) fstps 4(%esp) flds 32(%esp) flds 16(%esp) flds 16(%esp) flds 32(%esp) fxch %st(2) faddp %st, %st(3) fldz fcomi %st(2), %st fstp %st(2) fxch %st(1) seta %dl xorl %eax, %eax fcomip %st(1), %st fstp %st(0) seta %al andl %edx, %eax fadds (%esp,%eax,4) xorl %eax, %eax fstps 32(%esp) 
0
source

All Articles