Is branching or multiplication more efficient?

Question

Is branching or multiplication more efficient?

I am trying to optimize a small, very used function that uses high bits in an unsigned short int to indicate the values of an array to be summed. At first, I used the obvious approach shown below. Note that loop unfolding does not explicitly show how this should be done by the compiler.

int total = 0; for(unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++){ if (i & mask){ total += value[j]; } }

However, I later thought that it was better to remove the branching to help in pipelining the processor, and came up with the following.

 int total = 0; for(unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++){ total += ((i & mask) != 0) * value[j]; }

Note that since (i and the mask) does not lead to a logical answer, comparing with 0 forces the result to be either 1 or 0. Although this second approach excludes the if-statement from this section of code, the second solution should trigger the multiplication of 0 or 1 by each iteration in addition to the rest of the equation.

Which code will run faster?

+6

c ++ optimization c branch if-statement

Nixuz Feb 05 '09 at 5:02

source share

12 answers

Which code will run faster?

Check it out to find out.

Also, look at the assembly language version of the code that the compiler emits, because there you can see something in it that surprises you and that hint at further optimization (for example, using short may require more instructions when using it, which use natural integer machine size).

+13

Chrisw Feb 05 '09 at 5:10

source share

Or it could be faster. For some processors, the actual input may change the response. You will need to profile both approaches with real data. Here are some things that may affect actual performance on x86 hardware.

Assume for the moment that you are using the latest Pentium 4. This processor has two levels of branch predictors baked into the CPU. If branch predictors can correctly guess the direction of the branch, I suspect that the first will be the fastest. Most likely, this will happen if the flags are almost the same, or if they most often alternate with a very simple template. If the flags are really random, then the branch predictor will be erroneous in half the cases. For our hypothetical 32-stage Pentium 4, this will kill performance. For Pentium 3 chips, Core 2, Core i7 chips and most AMD chips, pipelines are shorter, so the cost of bad branch prediction is much lower.

If your vector of values is noticeably larger than the processor cache, then any of these methods will be limited by the memory bandwidth. Both of them will be essentially identical characteristics. If the vector of values fits comfortably into the cache, be careful how you do profiling so that one of the test cycles is not punished for filling the cache, and the other benefits from it.

+9

Mr Fooz Feb 05 '09 at 5:15

source share

How about this version?

 int total = 0; for (unsigned mask = i & 0xFFFF, j = 0; mask != 0; mask >>= 1, j++){ total += (mask & 0x0001) * value[j]; }

I made mask into copy i limited to a 16-bit unsigned range, but the code checks to see if the last bit of the mask is set by multiplying the array value by that bit. This should be faster, because there are fewer operations per iteration, and only the main branches of the loop and conditions are needed. In addition, the cycle may exit earlier if i is small for a start.

This shows why measurement is important. I am using outdated Sun SPARC. I wrote a test program, as shown, with two applicants from the question like test 0 and test 1, and my own answer as test 2. And then time tests were conducted. The "sum" is printed as a health check - so that all algorithms give the same answer.

64-bit version:

 gcc -m64 -std=c99 -I$HOME/inc -ox xc -L$HOME/lib/sparcv9 -ljl -lposix4 Test 0: (sum = 1744366) 7.973411 us Test 1: (sum = 1744366) 10.269095 us Test 2: (sum = 1744366) 7.475852 us

Nice: mine is a little faster than the original, and the drier version is slower.

Optimized 64-bit:

 gcc -O4 -m64 -std=c99 -I$HOME/inc -ox xc -L$HOME/lib/sparcv9 -ljl -lposix4 Test 0: (sum = 1744366) 1.101703 us Test 1: (sum = 1744366) 1.915972 us Test 2: (sum = 1744366) 2.575318 us

Darn - My version has now become the slowest. The optimizer is good!

Optimized 32-bit bit:

 gcc -O4 -std=c99 -I$HOME/inc -ox xc -L$HOME/lib -ljl -lposix4 Test 0: (sum = 1744366) 0.839278 us Test 1: (sum = 1744366) 1.905009 us Test 2: (sum = 1744366) 2.448998 us

32 bit version:

 gcc -std=c99 -I$HOME/inc -ox xc -L$HOME/lib -ljl -lposix4 Test 0: (sum = 1744366) 7.493672 us Test 1: (sum = 1744366) 9.610240 us Test 2: (sum = 1744366) 6.838929 us

The same code (32-bit) Cygwin and not very geriatric laptop (32-bit, optimized)

 Test 0: (sum = 1744366) 0.557000 us Test 1: (sum = 1744366) 0.553000 us Test 2: (sum = 1744366) 0.403000 us

Now my code is the fastest. That is why you measure! It also shows why people who use tests for life are crazy.

Check harness (scream if you want timer.h and timer.c code):

 #include <stdio.h> #include "timer.h" static volatile int value[] = { 12, 36, 79, 21, 31, 93, 24, 15, 56, 63, 20, 47, 62, 88, 9, 36, }; static int test_1(int i) { int total = 0; for (unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++) { if (i & mask) total += value[j]; } return(total); } static int test_2(int i) { int total = 0; for (unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++) { total += ((i & mask) != 0) * value[j]; } return(total); } static int test_3(int i) { int total = 0; for (unsigned mask = i & 0xFFFF, j = 0; mask != 0; mask >>= 1, j++) { total += (mask & 0x0001) * value[j]; } return(total); } typedef int(*func_pointer)(int); static func_pointer test[] = { test_1, test_2, test_3 }; #define DIM(x)(sizeof(x)/sizeof(*(x))) int main() { int i, j, k; char buffer[32]; for (i = 0; i < DIM(test); i++) { Clock t; long sum = 0; clk_init(&t); clk_start(&t); for (j = 0; j < 0xFFFF; j += 13) { int rv; for (k = 0; k < 1000; k++) rv = (*test[i])(j); sum += rv; } clk_stop(&t); printf("Test %d: (sum = %ld) %9s us\n", i, sum, clk_elapsed_us(&t, buffer, sizeof(buffer))); } }

I did not waste time developing why my code is slower at optimization.

+4

Jonathan leffler Feb 05 '09 at 6:15

source share

It depends entirely on the compiler, the set of machine instructions, and possibly the phase of the moon.

Because of this, there is no concrete correct answer. If you really want to know, check out the build from the compiler.

From a simplified point of view, I would say that the second is slower, since it includes all the calculations of the first plus a multiply. But the compiler can be smart enough to optimize this.

So the correct answer is: it depends.

+2

paxdiablo Feb 05 '09 at 5:06

source share

Although the second example does not have an explicit branch, it may be implicit to convert the comparison result to bool. You can get a little idea by including the output of assembly listing for your compiler and looking at it.

Of course, the only way to find out for sure is to take some timings in both directions.

+1

Mark ransom Feb 05 '09 at 5:07

source share

The answer must be: try it on the target hardware and see. And be sure to follow the tips of the many micro-control / stopwatch reference questions posted here on SO over the past few weeks.

Link to one benchmarking question: Are stopwatch benchmarks acceptable?

Personally, I would go with if if there wasn’t really a good reason to use the “obfuscation” alternative.

+1

Lawrence dol Feb 05 '09 at 5:07

source share

The only real way to determine the truth of a statement is to verify. With that in mind, I agree with the previous posts that say try it!

Branching is a costly process on most modern processors, and is often used by the industry. This is due to the fact that the pipeline must be reset, which leads to the fact that the CPU cannot try to execute one or several commands at the same time - simply because it does not know where the next instruction will come from. With multiple branches, possible control flows become difficult for the CPU to use all the features at the same time, so it must make a branch and then run several instructions at once.

+1

mP. Feb 05 '09 at 6:11

source share

why not do it (assuming I'm 32 bits)

  for (i2 = i; i2; i2 = i3) { i3 = i2 & (i2-1); last_bit = i2-i3; a = last_bit & 0xffff; b = (last_bit << 16); j = place[a] + big_place[b]; total += value[j]; }

Where the place is a table of size 2 ^ 15 + 1, such that place [0] = 0, place [1] = 1, place [2] = 2, place [4] = 3, place [8] = 4. .. place [15] = 16 (the rest of the values don’t matter). and big_place is almost the same: big_place [0] = 0, big_place [1] = 17 .... big_place [15] = 32.

+1

David Lehavi Feb 05 '09 at 6:16

source share

Try

 total += (-((i & mask) != 0)) & value[j];

instead

 total += ((i & mask) != 0) * value[j];

This avoids multiplication. Whether there will be a branch or not depends on whether the compiler is sufficiently equipped to find free code for - (foo! = 0). (This is possible, but I would be a little surprised.)

(Of course, it depends on the view with two additions, the C standard is agnostic on this.)

You could help with the compiler, so assuming a 32-bit int and that a signed → extends the sign bit:

 total += (((int)((i & mask) << (31 - j))) >> 31) & value[j];

That is, shift the bit of the possible set from the left to the most significant position, starting it as a signed int, then return all the paths back to the least significant position, getting either all 0 or all 1 in accordance with the above implementation - certain assumptions. (I have not tested this.)

Another possibility: consider blocks (say) 4 bits at a time. There are 16 different addition sequences; You can send detailed code for each of them, without any tests in each block of code. The hope here is that one indirect jump cost less than 4 test and branches.

Update: Using the forests of Jonathan Leffler, the 4-bit time-based method is fastest with the greatest scope on my MacBook. Deny - and it turns out to be about the same as multiply. I wonder if the processor multiplies special cases such as 0 and 1 faster (or not such a special case if it is faster for most bit-bits or most bit-sets in general).

I did not encode the accepted answer, since it would hardly be the fastest in this particular test (it should get most of its benefit from listing only the given bits, doing best on sparse sets, but completely half the bits set in this test). Here are my changes to Leffler's code if someone else is strangely motivated to spend time on this:

 #include <stdio.h> #include <time.h> static int value[] = { 12, 36, 79, 21, 31, 93, 24, 15, 56, 63, 20, 47, 62, 88, 9, 36, }; static int test_1(int i) { int total = 0; for (unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++) { if (i & mask) total += value[j]; } return(total); } static int test_2(int i) { int total = 0; for (unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++) { total += ((i & mask) != 0) * value[j]; } return(total); } static int test_3(int i) { int total = 0; for (unsigned mask = i & 0xFFFF, j = 0; mask != 0; mask >>= 1, j++) { total += (mask & 0x0001) * value[j]; } return(total); } static int test_4(int i) { int total = 0; for (unsigned mask = i & 0xFFFF, j = 0; mask != 0; mask >>= 1, j++) { total += -(mask & 0x0001) & value[j]; } return(total); } static int test_5(int i) { int total = 0; const int *p = value; for (unsigned mask = i & 0xFFFF; mask != 0; mask >>= 4, p += 4) { switch (mask & 0xF) { case 0x0: break; case 0x1: total += p[0]; break; case 0x2: total += p[1]; break; case 0x3: total += p[1] + p[0]; break; case 0x4: total += p[2]; break; case 0x5: total += p[2] + p[0]; break; case 0x6: total += p[2] + p[1]; break; case 0x7: total += p[2] + p[1] + p[0]; break; case 0x8: total += p[3]; break; case 0x9: total += p[3] + p[0]; break; case 0xA: total += p[3] + p[1]; break; case 0xB: total += p[3] + p[1] + p[0]; break; case 0xC: total += p[3] + p[2]; break; case 0xD: total += p[3] + p[2] + p[0]; break; case 0xE: total += p[3] + p[2] + p[1]; break; case 0xF: total += p[3] + p[2] + p[1] + p[0]; break; } } return(total); } typedef int(*func_pointer)(int); static func_pointer test[] = { test_1, test_2, test_3, test_4, test_5 }; #define DIM(x)(sizeof(x)/sizeof(*(x))) int main() { int i, j, k; for (i = 0; i < DIM(test); i++) { long sum = 0; clock_t start = clock(); for (j = 0; j <= 0xFFFF; j += 13) { int rv; for (k = 0; k < 1000; k++) rv = (*test[i])(j); sum += rv; } clock_t stop = clock(); printf("(sum = %ld) Test %d: %8.6fs\n", sum, i + 1, (stop - start) / (1.0 * CLOCKS_PER_SEC)); } }

Results ( gcc -O4 -std=c99 branchmult2.c ):

 (sum = 1744366) Test 1: 0.225497 s (sum = 1744366) Test 2: 0.221127 s (sum = 1744366) Test 3: 0.126301 s (sum = 1744366) Test 4: 0.124750 s (sum = 1744366) Test 5: 0.064877 s

Edit 2: I decided that the test would be more realistic without the volatile qualifier.

+1

Darius bacon Feb 05 '09 at 6:18

source share

To be sure, you can avoid the loop, shifts and multiplications - use the switch.

 switch (i) { case 0: break; case 1: total = value[0]; break; case 2: total = value[1]; break; case 3: total = value[1] + value[0]; break; case 4: total = value[2]; break; case 5: total = value[2] + value[0]; break; ... }

This is a lot, but I think it will be much faster at runtime. You cannot exceed the performance of a lookup table!

I would rather write a small Perl script that will generate this code for me - just so as not to print errors.

If you think this is a little extreme, you can use a smaller table - for 4 bits and search several times, changing the mask each time. There will be little performance, but the code will be much less.

+1

qrdl Feb 05 '09 at 7:41

source share

The obvious solution:

 int total = 0; for(unsigned j = 0; j < 16; j++){ total += -(i>>j & 1) & value[j]; }

+1

R .. Feb 10 '11 at 11:10

source share

Msn · Accepted Answer · 2009-02-05T05:19:15+0000

You can make it calm without breeding. It looks like for each bit set you are using this bit position as an index into an array.

Firstly, you can easily extract bits with:

 unsigned short set_mask= i & -i; i&= i - 1;

Then you can get the bit index by counting the bit set in (set_mask - 1) . For this there is a formula with a constant time.

Some platforms also have a built-in bit set bitmap index, which is probably faster. x86 has bsr , PPC has cntlz .

So the answer is a forked multi-user version, possibly the fastest :)

Is branching or multiplication more efficient?

More articles: