Industry prediction for free?

I just stumbled upon this thing, and I am very curious if modern processors (current, possibly mobile (built-in)) can actually have no branching costs in the situation below.

1. Say we have this:

x += a; // let assume they are both declared earlier as simple ints if (flag) do A // let assume A is not the same as B else do B // and of course B is different than A 

2. Compared to this:

 if (flag) { x += a do A } else { x += a do B } 

Assuming that A and B completely different in the topics of the pipeline instructions (sampling, decoding, execution, etc.):

  • Is the second approach faster?

  • Are the processors smart enough to say that no matter what the flag is, the next instruction is the same (so they don’t have to give up the pipeline steps due to branch miss prediction)?

Note:

In the first case, the CPU does not have an option, but in order to discard the first few stages of the do A or do B pipeline if a prediction about branch skipping occurred because they are different. I see that the second example is somehow a delayed branch like: "I’m going to check this flag, even if I don’t know the flag, I can continue with the following instruction because it is the same no matter what flag it is I already have the following instructions, and I normally use it. "

EDIT:
I have done some research, and I have good results. How would you explain this behavior? Sorry for my last change, but I had some cache problems as far as I could see, these are more accurate results and code samples, I hope.

Here is the code compiled with gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) using -O3.

Case 1

 #include <stdio.h> extern int * cache; extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) { ++*cache; *x += *a; if (*b) { A(); } else { B(); } } delete b; delete x; delete a; delete loop; delete cache; return 0; } int * cache = new int(0); bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0x0ffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; } 

Case 2

 #include <stdio.h> extern int * cache; extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) { ++*cache; if (*b) { *x += *a; A(); } else { *x += *a; B(); } } delete b; delete x; delete a; delete loop; delete cache; return 0; } int * cache = new int(0); bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0x0ffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; } 

There is quite a bit of an imperceptible difference between the -O3 versions of both approaches, but without -O3 the second case works a little faster, at least on my machine. I tested without -O3 and with a loop = 0xfffffffe.
Best times:
alin @ubuntu: ~ / Desktop $ time. / 1

real 0m20.231s
user 0m20.224s
sys 0m0.020s

alin @ubuntu: ~ / Desktop $ time. / 2

real 0m19.932s
user 0m19.890s
sys 0m0.060s

+8
c ++ branch-prediction c pipeline
source share
2 answers

There are two parts:

First, does the compiler optimize this?

Run the experiment:

test.cc

 #include <random> #include "test2.h" int main() { std::default_random_engine e; std::uniform_int_distribution<int> d(0,1); int flag = d(e); int x = 0; int a = 1; if (flag) { x += a; doA(x); return x; } else { x += a; doB(x); return x; } } 

test2.h

 void doA(int& x); void doB(int& x); 

test2.cc

 void doA(int& x) {} void doB(int& x) {} 

test2.cc and test2.h both exist solely to prevent the compiler from optimizing everything. The compiler cannot be sure that there is no side effect, since these functions exist in another translation unit.

Now we are building the assembly:

 gcc -std=c++11 -S test.cc 

And let's move on to the interesting part of the assembly:

  call _ZNSt24uniform_int_distributionIiEclISt26linear_congruential_engineImLm16807ELm0ELm2147483647EEEEiRT_ movl %eax, -40(%rbp); <- setting flag movl $0, -44(%rbp); <- setting x movl $1, -36(%rbp); <- setting a cmpl $0, -40(%rbp); <- first part of if (flag) je .L2; <- second part of if (flag) movl -44(%rbp), %edx <- setting up x movl -36(%rbp), %eax <- setting up a addl %edx, %eax <- adding x and a movl %eax, -44(%rbp) <- assigning back to x leaq -44(%rbp), %rax <- grabbing address of x movq %rax, %rdi <- bookkeeping for function call call _Z3doARi <- function call doA movl -44(%rbp), %eax jmp .L4 .L2: movl -44(%rbp), %edx <- setting up x movl -36(%rbp), %eax <- setting up a addl %edx, %eax <- perform the addition movl %eax, -44(%rbp) <- move it back to x leaq -44(%rbp), %rax <- and so on movq %rax, %rdi call _Z3doBRi movl -44(%rbp), %eax .L4: 

So, we see that the compiler has not optimized it. But we also did not ask about it.

 g++ -std=c++11 -S -O3 test.cc 

and then an interesting assembly:

 main: .LFB4729: .cfi_startproc subq $56, %rsp .cfi_def_cfa_offset 64 leaq 32(%rsp), %rdx leaq 16(%rsp), %rsi movq $1, 16(%rsp) movq %fs:40, %rax movq %rax, 40(%rsp) xorl %eax, %eax movq %rdx, %rdi movl $0, 32(%rsp) movl $1, 36(%rsp) call _ZNSt24uniform_int_distributionIiEclISt26linear_congruential_engineImLm16807ELm0ELm2147483647EEEEiRT_RKNS0_10param_typeE testl %eax, %eax movl $1, 12(%rsp) leaq 12(%rsp), %rdi jne .L83 call _Z3doBRi movl 12(%rsp), %eax .L80: movq 40(%rsp), %rcx xorq %fs:40, %rcx jne .L84 addq $56, %rsp .cfi_remember_state .cfi_def_cfa_offset 8 ret .L83: .cfi_restore_state call _Z3doARi movl 12(%rsp), %eax jmp .L80 

This is slightly different from my ability to clearly show the 1 to 1 relationship between the assembly and the code, but you can tell from the doA and doB calls that the configuration is common and performed outside the if statement. (Above line jne.L83). So yes, compilers really do this optimization.

Part 2:

How do I know if processors can perform this optimization if the first code is given?

Actually, I don’t know how to check this. So I don’t know. I would rate it as plausible, given that it is not in order and speculative execution exists. But the proof is in the pudding, and I have no way to test this pudding. Therefore, I am reluctant to sue one way or another.

+6
source share

On the same day, the CPU clearly supported something like this - after the branch command, the next command was always executed regardless of whether the branch was actually executed (find the "branch delay time interval").

I am sure that modern processors simply dump the entire conveyor according to the wrong industry forecast. It makes no sense to try to do the optimization that you offer at runtime, when the compiler can easily do this at compile time.

+5
source share

All Articles