Try
total += (-((i & mask) != 0)) & value[j];
instead
total += ((i & mask) != 0) * value[j];
This avoids multiplication. Whether there will be a branch or not depends on whether the compiler is sufficiently equipped to find free code for - (foo! = 0). (This is possible, but I would be a little surprised.)
(Of course, it depends on the view with two additions, the C standard is agnostic on this.)
You could help with the compiler, so assuming a 32-bit int and that a signed โ extends the sign bit:
total += (((int)((i & mask) << (31 - j))) >> 31) & value[j];
That is, shift the bit of the possible set from the left to the most significant position, starting it as a signed int, then return all the paths back to the least significant position, getting either all 0 or all 1 in accordance with the above implementation - certain assumptions. (I have not tested this.)
Another possibility: consider blocks (say) 4 bits at a time. There are 16 different addition sequences; You can send detailed code for each of them, without any tests in each block of code. The hope here is that one indirect jump cost less than 4 test and branches.
Update: Using the forests of Jonathan Leffler, the 4-bit time-based method is fastest with the greatest scope on my MacBook. Deny - and it turns out to be about the same as multiply. I wonder if the processor multiplies special cases such as 0 and 1 faster (or not such a special case if it is faster for most bit-bits or most bit-sets in general).
I did not encode the accepted answer, since it would hardly be the fastest in this particular test (it should get most of its benefit from listing only the given bits, doing best on sparse sets, but completely half the bits set in this test). Here are my changes to Leffler's code if someone else is strangely motivated to spend time on this:
#include <stdio.h> #include <time.h> static int value[] = { 12, 36, 79, 21, 31, 93, 24, 15, 56, 63, 20, 47, 62, 88, 9, 36, }; static int test_1(int i) { int total = 0; for (unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++) { if (i & mask) total += value[j]; } return(total); } static int test_2(int i) { int total = 0; for (unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++) { total += ((i & mask) != 0) * value[j]; } return(total); } static int test_3(int i) { int total = 0; for (unsigned mask = i & 0xFFFF, j = 0; mask != 0; mask >>= 1, j++) { total += (mask & 0x0001) * value[j]; } return(total); } static int test_4(int i) { int total = 0; for (unsigned mask = i & 0xFFFF, j = 0; mask != 0; mask >>= 1, j++) { total += -(mask & 0x0001) & value[j]; } return(total); } static int test_5(int i) { int total = 0; const int *p = value; for (unsigned mask = i & 0xFFFF; mask != 0; mask >>= 4, p += 4) { switch (mask & 0xF) { case 0x0: break; case 0x1: total += p[0]; break; case 0x2: total += p[1]; break; case 0x3: total += p[1] + p[0]; break; case 0x4: total += p[2]; break; case 0x5: total += p[2] + p[0]; break; case 0x6: total += p[2] + p[1]; break; case 0x7: total += p[2] + p[1] + p[0]; break; case 0x8: total += p[3]; break; case 0x9: total += p[3] + p[0]; break; case 0xA: total += p[3] + p[1]; break; case 0xB: total += p[3] + p[1] + p[0]; break; case 0xC: total += p[3] + p[2]; break; case 0xD: total += p[3] + p[2] + p[0]; break; case 0xE: total += p[3] + p[2] + p[1]; break; case 0xF: total += p[3] + p[2] + p[1] + p[0]; break; } } return(total); } typedef int(*func_pointer)(int); static func_pointer test[] = { test_1, test_2, test_3, test_4, test_5 }; #define DIM(x)(sizeof(x)/sizeof(*(x))) int main() { int i, j, k; for (i = 0; i < DIM(test); i++) { long sum = 0; clock_t start = clock(); for (j = 0; j <= 0xFFFF; j += 13) { int rv; for (k = 0; k < 1000; k++) rv = (*test[i])(j); sum += rv; } clock_t stop = clock(); printf("(sum = %ld) Test %d: %8.6fs\n", sum, i + 1, (stop - start) / (1.0 * CLOCKS_PER_SEC)); } }
Results ( gcc -O4 -std=c99 branchmult2.c ):
(sum = 1744366) Test 1: 0.225497 s (sum = 1744366) Test 2: 0.221127 s (sum = 1744366) Test 3: 0.126301 s (sum = 1744366) Test 4: 0.124750 s (sum = 1744366) Test 5: 0.064877 s
Edit 2: I decided that the test would be more realistic without the volatile qualifier.