Optimization, branch elimination

Question

Optimization, branch elimination

float mixValue = ... //in range -1.0f to 1.0f for(... ; ... ; ... ) //long loop { float inputLevel = ... //in range -1.0f to 1.0f if(inputLevel < 0.0 && mixValue < 0.0) { mixValue = (mixValue + inputLevel) + (mixValue*inputLevel); } else { mixValue = (mixValue + inputLevel) - (mixValue*inputLevel); } }

just a question, can we calculate mixValue without branching ? or any other optimization suggestion, such as using SIMD?

edit: just for more information, I ended up using this solution based on the selected answer:

 const float sign[] = {-1, 1}; float mixValue = ... //in range -1.0f to 1.0f for(... ; ... ; ... ) //long loop { float inputLevel = ... //in range -1.0f to 1.0f unsigned a = *(unsigned*)(&mixValue); unsigned b = *(unsigned*)(&inputLevel); float mulValue = mixValue * inputLevel * sign[(a & b) >> (8*sizeof(unsigned)-1)]; float addValue = mixValue + inputLevel; mixValue = addValue + mulValue; }

thanks.

+4

c ++ optimization floating-point branch

uray Aug 2 '10 at 12:46

source share

8 answers

How about this:

 const float sign[] = {-1, 1}; float mixValue = ... //in range -1.0f to 1.0f for(... ; ... ; ... ) //long loop { float inputLevel = ... //in range -1.0f to 1.0f int bothNegative = (inputLevel < 0.0) & (mixValue < 0.0); mixValue = (mixValue + inputLevel) + (sign[bothNegative]*mixValue*inputLevel); }

Edit: Mike was right that && would introduce the branch and thanks to Pedro for his proof. I changed && and now GCC (version 4.4.0) generates code without branches.

+4

user362638 Aug 2 '10 at 13:21

source share

 float mixValue = ... //in range -1.0f to 1.0f for(... ; ... ; ... ) //long loop { float inputLevel = ... //in range -1.0f to 1.0f float mulValue = mixValue * inputLevel; float addValue = mixValue + inputLevel; __int32 a = *(__int32*)(&mixValue); __int32 b = *(__int32*)(&inputLevel); __int32 c = *(__int32*)(&mulValue); __int32 d = c & ((a ^ b) | 0x7FFFFFFF); mixValue = addValue + *(float*)(&d); }

+1

uray Aug 2 '10 at 13:58

source share

Have you compared the loop with and without a branch?

At the very least, you can remove one part of the branch, since mixValue is out of loop.

 float multiplier(float a, float b){ unsigned char c1Neg = reinterpret_cast<unsigned char *>(&a)[3] & 0x80; unsigned char c2Neg = reinterpret_cast<unsigned char *>(&b)[3] & 0x80; unsigned char multiplierIsNeg = c1Neg & c2Neg; float one = 1; reinterpret_cast<unsigned char *>(&one)[3] |= multiplierIsNeg; return -one; } cout << multiplier(-1,-1) << endl; // +1 cout << multiplier( 1,-1) << endl; // -1 cout << multiplier( 1, 1) << endl; // -1 cout << multiplier(-1, 1) << endl; // -1

0

tibur Aug 2 '10 at 12:56

source share

If you are concerned about excessive branching, check out Duff Device . This should help to relax a bit. In truth, the unwinding cycle is what the optimizer will do, so trying to do it manually can be a waste of time. Check assembly output to find out.

SIMD will definitely help if you perform the same operation for each element in your array. Keep in mind that not all hardware supports SIMD, but some compilers, such as gcc, provide built-in functions for SIMD, which saves you from diving into assembler.

If you use gcc to compile ARM code, the built-in SIMD functions can be found here.

0

doron Aug 2 '10 at 13:08

source share

On top of my head (I'm sure it can be reduced):

mixValue = (mixValue + inputLevel) + (((mixValue / fabs(mixValue)) + (inputLevel / fabs(inputLevel))+1) / fabs(((mixValue / fabs(mixValue)) + (inputLevel / fabs(inputLevel))+1)))*-1*(mixValue*inputLevel);

To clarify a bit, I will calculate the sign separately:

 float sign = (((mixValue / fabs(mixValue)) + (inputLevel / fabs(inputLevel))+1) / fabs(((mixValue / fabs(mixValue)) + (inputLevel / fabs(inputLevel))+1)))*-1; mixValue = (mixValue + inputLevel) + sign*(mixValue*inputLevel);

This is floating point math, so you probably have to fix some rounding problems, but that should put you on the right track, I think.

0

Anthony diSanti Aug 2 '10 at 13:13

source share

mixValue looking at your code, you will see that you will always add the absolute value of mixValue and inputLevel , unless both are positive.

With some knowledge of bit fiddling and IEEE floating point, you can get rid of the conditional:

 // sets the first bit of f to zero => makes it positive. void absf( float& f ) { assert( sizeof( float ) == sizeof( int ) ); reinterpret_cast<int&>( f ) &= ~0x80000000; } // returns a first-bit = 1 if f is positive int pos( float& f ) { return ~(reinterpret_cast<int&>(f) & 0x80000000) & 0x80000000; } // returns -fabs( f*g ) if f>0 and g>0, fabs(f*g) otherwise. float prod( float& f, float& g ) { float p = f*g; float& rp=p; int& ri = reinterpret_cast<int&>(rp); absf(p); ri |= ( pos(f) & pos(g) & 0x80000000); // first bit = + & + return p; } int main(){ struct T { float f, g, r; void test() { float p = prod(f,g); float d = (pr)/r; assert( -1e-15 < d && d < 1e-15 ); } }; T vals[] = { {1,1,-1},{1,-1,1},{-1,1,1},{-1,-1,1} }; for( T* val=vals; val != vals+4; ++val ) { val->test(); } }

And finally: your cycle

 for( ... ) { mixedResult += inputLevel + prod(mixedResult,inputLevel); }

Note. The size of your accumulation does not match. inputLevel is a dimensionless quantity, and mixedResult is your ... result (for example, in Pascal, in Volts, ...). You cannot add two sizes with different sizes. You probably want mixedResult += prod( mixedResult, inputLevel ) as your battery.

0

xtofl Aug 2 '10 at 22:08

source share

Some compilers (i.e. MSC) also require manual character validation.

A source:

 volatile float mixValue; volatile float inputLevel; float u = mixValue*inputLevel; float v = -u; float a[] = { v, u }; mixValue = (mixValue + inputLevel) + a[ (inputLevel<0.0) & (mixValue<0.0) ];

IntelC 11.1:

 movss xmm1, DWORD PTR [12+esp] mulss xmm1, DWORD PTR [16+esp] movss xmm6, DWORD PTR [12+esp] movss xmm2, DWORD PTR [16+esp] movss xmm3, DWORD PTR [16+esp] movss xmm5, DWORD PTR [12+esp] xorps xmm4, xmm4 movaps xmm0, xmm4 subss xmm0, xmm1 movss DWORD PTR [esp], xmm0 movss DWORD PTR [4+esp], xmm1 addss xmm6, xmm2 xor eax, eax cmpltss xmm3, xmm4 movd ecx, xmm3 neg ecx cmpltss xmm5, xmm4 movd edx, xmm5 neg edx and ecx, edx addss xmm6, DWORD PTR [esp+ecx*4] movss DWORD PTR [12+esp], xmm6

gcc 4.5:

 flds 32(%esp) flds 16(%esp) fmulp %st, %st(1) fld %st(0) fchs fstps (%esp) fstps 4(%esp) flds 32(%esp) flds 16(%esp) flds 16(%esp) flds 32(%esp) fxch %st(2) faddp %st, %st(3) fldz fcomi %st(2), %st fstp %st(2) fxch %st(1) seta %dl xorl %eax, %eax fcomip %st(1), %st fstp %st(0) seta %al andl %edx, %eax fadds (%esp,%eax,4) xorl %eax, %eax fstps 32(%esp)

0

Shelwien Aug 2 '10 at 22:55

source share

Pedro d'quino · Accepted Answer · 2010-08-02T14:10:54+0000

Inspired by Rock's answer (which is on MSVC ++ 10 branches), this does not look like a branch:

 #include <iostream> using namespace std; const float sign[] = {-1, 1}; int main() { const int N = 10; float mixValue = -0.5F; for(int i = 0; i < N; i++) { volatile float inputLevel = -0.3F; int bothNegative = ((((unsigned char*)&inputLevel)[3] & 0x80) & (((unsigned char*)&mixValue)[3] & 0x80)) >> 7; mixValue = (mixValue + inputLevel) + (sign[bothNegative]*mixValue*inputLevel); } std::cout << mixValue << std::endl; }

Here is the disassembly analyzed using IDA Pro (compiled in MSVC ++ 10, Release mode):

Dismantling http://img248.imageshack.us/img248/6865/floattestbranchmine.png

Optimization, branch elimination

More articles: