What is the best way (by performance) to check if a value falls into a threshold?

That is the fastest way to run a test

if( a >= ( b - c > 0 ? b - c : 0 ) && a <= ( b + c < 255 ? b + c : 255 ) ) ... 

if a, b and c are all unsigned char aka BYTE . I'm trying to optimize the image scanning process to find a sub-image, and the comparison, for example, is done about 3 million times per scan, so even small optimizations can be useful.

Not sure, but maybe some bitwise operation? Maybe adding from 1 to c and testing for less and more than without or without equal parts? I dont know!

+1
source share
4 answers

Well, first of all, let's see what you are trying to check without any over / underflow checks:

 a >= b - c a <= b + c subtract b from both: a - b >= -c a - b <= c 

Now it is equal

 abs(a - b) <= c 

And in the code:

 (a>b ? ab : ba) <= c 

Now this code is a little faster and does not contain (or needs) complex overflow / overflow checks.


I profiled mine and code 6502 with 1,000,000,000 repetitions, and officially there was no difference . I would suggest choosing the most elegant solution (it's IMO mine, but opinions are different), since performance is not an argument.


However, there was a noticeable difference between mine and the code request. This is the profiling code I used:

 #include <iostream> int main(int argc, char *argv[]) { bool prevent_opti; for (int ai = 0; ai < 256; ++ai) { for (int bi = 0; bi < 256; ++bi) { for (int ci = 0; ci < 256; ++ci) { unsigned char a = ai; unsigned char b = bi; unsigned char c = ci; if ((a>b ? ab : ba) <= c) prevent_opti = true; } } } std::cout << prevent_opti << "\n"; return 0; } 

Using the if statement, this averaged 120 ms, and the appeal operator took an average of 135 ms.

+4
source

Just using plain int for a , b and c will allow you to change the code to a simpler

 if (a >= b - c && a <= b + c) ... 

In addition, as an alternative, 256 * 256 * 256 is only 16 M, and a card of 16 M bits is 2 MB. This means that you can use a lookup table, for example

 int index = (a<<16) + (b<<8) + c; if (lookup_table[index>>3] & (1<<(index&7))) ... 

but I think a cache failure will make it a lot slower, even if modern processors hate conditional expressions ...

Another alternative is using bit algebra

 b - c <= a <= b + c iff - c <= a - b <= c (subtracted b from all terms) iff 0 <= a - b + c <= 2*c (added c to all terms) 

this allows you to use only one test

 if ((unsigned)(a - b + c) < 2*c) ... 

assuming a , b and c are equal to int s. The reason is that if a - b + c negative, then unsigned arithmetic will make it much larger than 2*c (if c is 0..255). This should generate efficient single-branch machine code if the processor has special signature / unsigned comparison commands, such as x86 (ja / jg).

 #include <stdio.h> int main() { int err = 0; for (int ia=0; ia<256; ia++) for (int ib=0; ib<256; ib++) for (int ic=0; ic<256; ic++) { unsigned char a = ia; unsigned char b = ib; unsigned char c = ic; int res1 = (a >= ( b - c > 0 ? b - c : 0 ) && a <= ( b + c < 255 ? b + c : 255 )); int res2 = (unsigned(a - b + c) <= 2*c); err += (res1 != res2); } printf("Errors = %i\n", err); return 0; } 

On x86 with g ++, the assembler code generated for the res2 test includes only one conditional instruction.

Assembler code for the following loop:

 void process(unsigned char *src, unsigned char *dst, int sz) { for (int i=0; i<sz; i+=3) { unsigned char a = src[i]; unsigned char b = src[i+1]; unsigned char c = src[i+2]; dst[i] = (unsigned(a - b + c) <= 2*c); } } .L3: movzbl 2(%ecx,%eax), %ebx ; This loads c movzbl (%ecx,%eax), %edx ; This loads a movzbl 1(%ecx,%eax), %esi ; This loads b leal (%ebx,%edx), %edx ; This computes a + c addl %ebx, %ebx ; This is c * 2 subl %esi, %edx ; This is a - b + c cmpl %ebx, %edx ; Comparison setbe (%edi,%eax) ; Set 0/1 depending on result addl $3, %eax ; next group cmpl %eax, 16(%ebp) ; did we finish ? jg .L3 ; if not loop back for next 

Using instead dst[i] = (a<b ? ba : ab); the code gets much longer

 .L9: movzbl %dl, %edx andl $255, %esi subl %esi, %edx .L4: andl $255, %edi cmpl %edi, %edx movl 12(%ebp), %edx setle (%edx,%eax) addl $3, %eax cmpl %eax, 16(%ebp) jle .L6 .L5: movzbl (%ecx,%eax), %edx movb %dl, -13(%ebp) movzbl 1(%ecx,%eax), %esi movzbl 2(%ecx,%eax), %edi movl %esi, %ebx cmpb %bl, %dl ja .L9 movl %esi, %ebx movzbl %bl, %edx movzbl -13(%ebp), %ebx subl %ebx, %edx jmp .L4 .p2align 4,,7 .p2align 3 .L6: 

And I'm too tired to try to decrypt it (2:28 here)

In any case, longer doesnโ€™t mean that it needs to be slower (at first glance it seems that g ++ decided to expand the loop by writing several elements at a time in this case).

As I said, you must do the actual profiling with your real calculations and your real data. Please note that if true performance is required, it may be that the best strategy will vary depending on the processor.

For example, Linux does an ae test at boot time to determine which is a faster way to perform the specific computation that is needed in the kernel. Variables are too large (cache size / levels, rod rotation speed, processor clock, chipset, processor type ...).

+2
source

It is thought that you will get maximum performance by writing it as clearly as possible, and then turning on compiler optimizers. The compiler is pretty good at such optimization and will beat you most of the time (in the worst case, it will equal you).

My preference would be:

 int min = (bc) > 0 ? (bc) : 0 ; int max = (b+c) < 255? (b+c) : 255; if ((a >= min) && ( a<= max)) 

Source code: (in assembly)

 movl %eax, %ecx movl %ebx, %eax subl %ecx, %eax movl $0, %edx cmovs %edx, %eax cmpl %eax, %r12d jl L13 leal (%rcx,%rbx), %eax cmpl $255, %eax movb $-1, %dl cmovg %edx, %eax cmpl %eax, %r12d jmp L13 

My code (in assembly)

 movl %eax, %ecx movl %ebx, %eax subl %ecx, %eax movl $0, %edx cmovs %edx, %eax cmpl %eax, %r12d jl L13 leal (%rcx,%rbx), %eax cmpl $255, %eax movb $-1, %dl cmovg %edx, %eax cmpl %eax, %r12d jg L13 

night stage table code (in assembly)

 movl %r12d, %edx subl %ebx, %edx movl %ebx, %ecx subl %r12d, %ecx cmpl %ebx, %r12d cmovle %ecx, %edx cmpl %eax, %edx jg L16 
+2
source

The rare embedding of a ternary operator in another operator improves performance :)

If each separate code code matters, write op codes yourself - use assembler. Also consider using simd instructions if possible. I am also interested in the target platform. The ARM assembler loves such comparisons and has opcodes to speed up the rich mathematics of this type.

+1
source

All Articles