Just using plain int for a , b and c will allow you to change the code to a simpler
if (a >= b - c && a <= b + c) ...
In addition, as an alternative, 256 * 256 * 256 is only 16 M, and a card of 16 M bits is 2 MB. This means that you can use a lookup table, for example
int index = (a<<16) + (b<<8) + c; if (lookup_table[index>>3] & (1<<(index&7))) ...
but I think a cache failure will make it a lot slower, even if modern processors hate conditional expressions ...
Another alternative is using bit algebra
b - c <= a <= b + c iff - c <= a - b <= c (subtracted b from all terms) iff 0 <= a - b + c <= 2*c (added c to all terms)
this allows you to use only one test
if ((unsigned)(a - b + c) < 2*c) ...
assuming a , b and c are equal to int s. The reason is that if a - b + c negative, then unsigned arithmetic will make it much larger than 2*c (if c is 0..255). This should generate efficient single-branch machine code if the processor has special signature / unsigned comparison commands, such as x86 (ja / jg).
#include <stdio.h> int main() { int err = 0; for (int ia=0; ia<256; ia++) for (int ib=0; ib<256; ib++) for (int ic=0; ic<256; ic++) { unsigned char a = ia; unsigned char b = ib; unsigned char c = ic; int res1 = (a >= ( b - c > 0 ? b - c : 0 ) && a <= ( b + c < 255 ? b + c : 255 )); int res2 = (unsigned(a - b + c) <= 2*c); err += (res1 != res2); } printf("Errors = %i\n", err); return 0; }
On x86 with g ++, the assembler code generated for the res2 test includes only one conditional instruction.
Assembler code for the following loop:
void process(unsigned char *src, unsigned char *dst, int sz) { for (int i=0; i<sz; i+=3) { unsigned char a = src[i]; unsigned char b = src[i+1]; unsigned char c = src[i+2]; dst[i] = (unsigned(a - b + c) <= 2*c); } } .L3: movzbl 2(%ecx,%eax), %ebx ; This loads c movzbl (%ecx,%eax), %edx ; This loads a movzbl 1(%ecx,%eax), %esi ; This loads b leal (%ebx,%edx), %edx ; This computes a + c addl %ebx, %ebx ; This is c * 2 subl %esi, %edx ; This is a - b + c cmpl %ebx, %edx ; Comparison setbe (%edi,%eax) ; Set 0/1 depending on result addl $3, %eax ; next group cmpl %eax, 16(%ebp) ; did we finish ? jg .L3 ; if not loop back for next
Using instead dst[i] = (a<b ? ba : ab); the code gets much longer
.L9: movzbl %dl, %edx andl $255, %esi subl %esi, %edx .L4: andl $255, %edi cmpl %edi, %edx movl 12(%ebp), %edx setle (%edx,%eax) addl $3, %eax cmpl %eax, 16(%ebp) jle .L6 .L5: movzbl (%ecx,%eax), %edx movb %dl, -13(%ebp) movzbl 1(%ecx,%eax), %esi movzbl 2(%ecx,%eax), %edi movl %esi, %ebx cmpb %bl, %dl ja .L9 movl %esi, %ebx movzbl %bl, %edx movzbl -13(%ebp), %ebx subl %ebx, %edx jmp .L4 .p2align 4,,7 .p2align 3 .L6:
And I'm too tired to try to decrypt it (2:28 here)
In any case, longer doesnโt mean that it needs to be slower (at first glance it seems that g ++ decided to expand the loop by writing several elements at a time in this case).
As I said, you must do the actual profiling with your real calculations and your real data. Please note that if true performance is required, it may be that the best strategy will vary depending on the processor.
For example, Linux does an ae test at boot time to determine which is a faster way to perform the specific computation that is needed in the kernel. Variables are too large (cache size / levels, rod rotation speed, processor clock, chipset, processor type ...).