Why is GCC generating a faster program than Clang in this Fibonacci recursive code?

Question

Why is GCC generating a faster program than Clang in this Fibonacci recursive code?

Here is the code I tested:

#include <iostream> #include <chrono> using namespace std; #define CHRONO_NOW chrono::high_resolution_clock::now() #define CHRONO_DURATION(first,last) chrono::duration_cast<chrono::duration<double>>(last-first).count() int fib(int n) { if (n<2) return n; return fib(n-1) + fib(n-2); } int main() { auto t0 = CHRONO_NOW; cout << fib(45) << endl; cout << CHRONO_DURATION(t0, CHRONO_NOW) << endl; return 0; }

Of course, there are much faster ways to calculate Fibonacci numbers, but this is a good little stress test that focuses on recursive function calls. There is nothing else in the code except using a chronograph to measure time.

At first, I ran the test several times in Xcode on OS X (so I clanged) using the -O3 optimization. It took about 9 seconds.

Then I compiled the same code using gcc (g ++) in Ubuntu (again using -O3), and it took about 6.3 seconds to run this version! Also, I ran Ubuntu inside VirtualBox on my Mac, which could negatively impact performance, if at all.

So, let's go:

Clank on OS X → ~ 9 seconds
gcc in Ubuntu in VirtualBox → ~ 6.3 s.

I know that these are completely different compilers, so they do things differently, but all the tests I saw with gcc and clang showed a much smaller difference, and in some cases the difference was the opposite (clang is faster).

So is there any logical explanation for why in this particular example gcc hits the frog for miles

+7

c ++ compiler-optimization gcc x86 clang

adam10603 Mar 21 '15 at 18:08

source share

2 answers

GCC 4.9.2 in the compiler explorer does the loop deployment and includes many function calls, while Clang 3.5.1 calls fib twice every iteration without even a tail call optimization as shown below

 fib(int): # @fib(int) push rbp push rbx push rax mov ebx, edi cmp ebx, 2 jge .LBB0_1 mov eax, ebx jmp .LBB0_3 .LBB0_1: lea edi, dword ptr [rbx - 1] call fib(int) # fib(ebx - 1) mov ebp, eax add ebx, -2 mov edi, ebx call fib(int) # fib(ebx - 2) add eax, ebp .LBB0_3: add rsp, 8 pop rbx pop rbp ret

The GCC version is more than 10 times longer: only one fib call and 20+ labels to embed the call , which also means that the last call was optimized for jmp or GCC converted part of the recursion to iteration (since it allocates a large array to store intermediate values)

I also introduced ICC in perspective, and surprisingly it contains 10 call instructions inside fib , and it also calls inline fib 9 times inside main , but it does not convert recursive code to iterative

Here the compiler displays for comparison

Please note that you can change the code to make the output easier to read.

 int fib(int n) { if (n<2) return n; int t = fib(n-1); return t + fib(n-2); }

Now the compiler explorer will highlight which line of the source code the instruction in the output of the assembly matches with different colors, and you can easily see how two calls are made. The string return t + fib(n-2) compiled by GCC for

 .L3: mov eax, DWORD PTR [rsp+112] # n, %sfp add edx, DWORD PTR [rsp+64] # D.35656, %sfp add DWORD PTR [rsp+60], edx # %sfp, D.35656 sub DWORD PTR [rsp+104], 2 # %sfp,

+10

phuclv Mar 21 '15 at 18:29

source share

juhist · Accepted Answer · 2015-03-21T18:27:18+0000

I would not say that gcc hits for miles. In my opinion, the difference in performance (6.3 seconds versus 9 seconds) is quite small. On my FreeBSD system, clang takes 26.12 seconds, and gcc takes 10.55 seconds.

However, the debugging method is to use g ++ -S and clang ++ -S to get the assembly.

I tested this on my FreeBSD system. Assembly language files are too long to publish here, but it seems that gcc performs several levels of embedding in the fibonacci calculation function (there were 20 calls to fib ()!), Whereas clang just calls fib (n-1) and fib (n- 2) without investment levels.

By the way, my gcc version was 4.2.1 20070831 fixed [FreeBSD], and the clang version was 3.1 (branch / release_31 156863) 20120523. These were the versions that came with the FreeBSD 9.1-RELEAESE base system. The processor is a dual-core AMD Turion II Neo N40L processor (1497.54 MHz).

Why is GCC generating a faster program than Clang in this Fibonacci recursive code?

More articles: