What does branching do if it looks like C ++?

I understand that I have a big lack of knowledge in this area (a fancy way of saying that I don't know jack about this).

Is there any documentation on how and when to use them?

+8
c ++ branch if-statement
source share
3 answers

In addition to all summary diskless code (which will not cover everything, for example, FP), you will receive instructions specifically designed for creating unallocated code, it will be SETcc , FCMOVcc and CMOVcc for x86, which perform operations based on condition flags from comparison.

a really simple example would be (yes, the example is so simple that it probably will never write something like this, it could be clearly demonstrated):

 bool CheckZero(int x) { if(x == 0) return true; return false; //OR we can use: return (x == 0) ? true : false; } 

now a simple x86 compiler can compile it to:

  MOV EAX,[ESP + 4] TEXT EAX,EAX JE _TRUE XOR EAX,EAX RETN 4 _TRUE: MOV EAX,1 RETN 4 

The x86 optimizing compiler will port it to the following non-redistributable code (or similar):

 MOV ECX,[ESP + 4] XOR EAX,EAX TEST ECX,ECX SETE AL RETN 4 

A more complex example can be found here .

However, this is what the compiler will do, and not some that you should worry about yourself (at least not analyzing the output of your compiler). but if the code is needed for uptime, C ++ does not provide enough control for this, so you need to use the (built-in) build.

+10
source share

http://hbfs.wordpress.com/2008/08/05/branchless-equivalents-of-simple-functions/

I think (although I don’t know more than what I read on the page), this is a way to get if the functionality is without branching (which makes sense based on words without branching if;)). I don't know anymore.

Thanks to Mr. Google.

+4
source share

I wrote a triple logic simulator not so long ago, and this question was viable to me, since it directly affects the speed of my interpreter; I needed to model tons and tons of triple logical gates as quickly as possible.

In a binary coded triple system, one trit is packed into two bits. The most significant bit means negative and least significant, positive. Case "11" should not occur, but it must be handled appropriately and treated as 0.

Consider the inline int bct_decoder( unsigned bctData ) , which should return our formatted trit as a regular integer -1, 0, or 1; As I noticed, there are 4 approaches: I called them "cond", "mod", "math" and "lut"; Let's explore them

First based on jz | jnz and jl | jb conditional jumps thus cond. Its performance is not very good because it relies on a branch predictor. And even worse - it changes, because it is not known whether there will be one branch or two a priori. And here is an example:

 inline int bct_decoder_cond( unsigned bctData ) { unsigned lsB = bctData & 1; unsigned msB = bctData >> 1; return ( lsB == msB ) ? 0 : // most possible -> make zero fastest branch ( lsB > msB ) ? 1 : -1; } 

This is the slowest version, in the worst case, it can include 2 branches, and this is where the binary logic fails. On my 3770k it produces about 200MIPS on average according to random data. (here and after - each test is an average of 1000 attempts on a randomly filled 2mb dataset)

It then relies on the modulo operator, and its speed is somewhere between the first and third, but definitely faster - 600 MIPS:

 inline int bct_decoder_mod( unsigned bctData ) { return ( int )( ( bctData + 1 ) % 3 ) - 1; } 

The next is a ramified approach that includes only mathematics, thus mathematics; it does not involve jumps at all:

 inline int bct_decoder_math( unsigned bctData ) { return ( int )( bctData & 1 ) - ( int )( bctData >> 1 ); } 

It does what it should and behaves really cool. For comparison, the performance rating is 1000 MIPS, and it is 5 times faster than the branched version. The forked version is probably slowing down due to the lack of built-in 2-bit signed int support. But in my application, this is a pretty good version in itself.

If this is not enough, we can go further with something special. Next, the approach to the lookup table is called:

 inline int bct_decoder_lut( unsigned bctData ) { static const int decoderLUT[] = { 0, 1, -1, 0 }; return decoderLUT[ bctData & 0x3 ]; } 

In my case, one trit occupied only 2 bits, so the lut table was only 2b * 4 = 8 bytes, and it was worth a try. It fits in the cache and works quickly in the 1400-1600 MIPS, here my measurement accuracy is reduced. And it's 1.5x acceleration from a quick math approach. This is because you simply pre-calculated the result and one AND instruction. Unfortunately, caches are small and (if your index is longer than a few bits), you simply cannot use it.

So, I think I answered your question what branched / non-distributed code might affect. The answer is much better and with detailed examples, real applications and real results of performance measurements.

+1
source share

All Articles