Why will a recursive version of a function be faster than an iterative in C?

I check the difference between the two gradient descent implementations, I assumed that after optimizing the compiler, both versions of the algorithm would be equivalent.

To my surprise, the recursive version was significantly faster. I do not discard the actual defect in either version, or even in the way I measure time. Can you guys give me some ideas?

This is my code:

#include <stdio.h> #include <stdlib.h> #include <math.h> #include <time.h> #include <stdint.h> double f(double x) { return 2*x; } double descgrad(double xo, double xnew, double eps, double precision) { // printf("step ... x:%f Xp:%f, delta:%f\n",xo,xnew,fabs(xnew - xo)); if (fabs(xnew - xo) < precision) { return xnew; } else { descgrad(xnew, xnew - eps*f(xnew), eps, precision); } } double descgraditer(double xo, double xnew, double eps, double precision) { double Xo = xo; double Xn = xnew; while(fabs(Xn-Xo) > precision) { //printf("step ... x:%f Xp:%f, delta:%f\n",Xo,Xn,fabs(Xn - Xo)); Xo = Xn; Xn = Xo - eps * f(Xo); } return Xn; } int64_t timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p) { return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) - ((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec); } int main() { struct timespec s1, e1, s2, e2; clock_gettime(CLOCK_MONOTONIC, &s1); printf("Minimum : %f\n",descgraditer(100,99,0.01,0.00001)); clock_gettime(CLOCK_MONOTONIC, &e1); clock_gettime(CLOCK_MONOTONIC, &s2); printf("Minimum : %f\n",descgrad(100,99,0.01,0.00001)); clock_gettime(CLOCK_MONOTONIC, &e2); uint64_t dif1 = timespecDiff(&e1,&s1) / 1000; uint64_t dif2 = timespecDiff(&e2,&s2) / 1000; printf("time_iter:%llu ms, time_rec:%llu ms, ratio (dif1/dif2) :%g\n", dif1,dif2, ((double) ((double)dif1/(double)dif2))); printf("End. \n"); } 

I am compiling with gcc 4.5.2 on Ubuntu 11.04 with the following options: gcc grad.c -O3 -lrt -o dg

The output of my code is:

 Minimum : 0.000487 Minimum : 0.000487 time_iter:127 ms, time_rec:19 ms, ratio (dif1/dif2) :6.68421 End. 

I read a thread that also asks that a recursive version of the algorithm will be faster than an iterative one. The explanation is that, being a recursive version using the stack, and another version using some vectors, heap access slowed down the iterative version. But in this case (in the best sense) I just use the stack in both cases.

Am I missing something? Anything obvious that I don't see? Is my way of measuring time wrong? Any ideas?

EDIT: The mystery is resolved in a comment. As @TonyK said, initializing printf slowed down the first run. Sorry I missed this obvious thing.

BTW, the code compiles immediately without warning. I do not think that "return descgrad (.." is necessary, since the stop condition occurs earlier.

+7
source share
7 answers

I compiled and ran your code locally. Moving printf outside the time block makes each version executable at ~ 5ms every time.

Thus, the central error in your time is that you are measuring the complex printf beast, and its runtime overshadows the code that you are actually trying to measure.

My main() function now looks like this:

 int main() { struct timespec s1, e1, s2, e2; double d = 0.0; clock_gettime(CLOCK_MONOTONIC, &s1); d = descgraditer(100,99,0.01,0.00001); clock_gettime(CLOCK_MONOTONIC, &e1); printf("Minimum : %f\n", d); clock_gettime(CLOCK_MONOTONIC, &s2); d = descgrad(100,99,0.01,0.00001); clock_gettime(CLOCK_MONOTONIC, &e2); printf("Minimum : %f\n",d); uint64_t dif1 = timespecDiff(&e1,&s1) / 1000; uint64_t dif2 = timespecDiff(&e2,&s2) / 1000; printf("time_iter:%llu ms, time_rec:%llu ms, ratio (dif1/dif2) :%g\n", dif1,dif2, ((double) ((double)dif1/(double)dif2))); printf("End. \n"); } 
+10
source

Am I measuring time incorrectly?

Yes. In the short time intervals that you measure, the scheduler can have a huge impact on your program. You need to either take your test much longer to average such differences, or use CLOCK_PROCESS_CPUTIME_ID instead to measure the processor time used by your process.

+5
source

First, your recursive step skips return :

 double descgrad(double xo, double xnew, double eps, double precision) { if (fabs(xnew - xo) < precision) return xnew; else descgrad(xnew, xnew - eps*f(xnew), eps, precision); } 

Must be:

 double descgrad(double xo, double xnew, double eps, double precision) { if (fabs(xnew - xo) < precision) return xnew; else return descgrad(xnew, xnew - eps*f(xnew), eps, precision); } 

This control causes the return value of descgrad be undefined, so the compiler hardly needs to generate code for it;)

+3
source

For starters, you turned on printf while you were trying to measure. This is always a giant no-no, because it can and most likely will pause your process by making console output. In fact, making any system call can completely reset time measurements like these.

And secondly, as someone else mentioned, during this period of time, interrupt scheduler samples can have a huge impact.

This is not ideal, but try this for your core, and you will see that there are actually very few differences. As the number of cycles increases, the ratio approaches 1.0.

 #define LOOPCOUNT 100000 int main() { struct timespec s1, e1, s2, e2; int i; clock_gettime(CLOCK_MONOTONIC, &s1); for(i=0; i<LOOPCOUNT; i++) { descgraditer(100,99,0.01,0.00001); } clock_gettime(CLOCK_MONOTONIC, &e1); clock_gettime(CLOCK_MONOTONIC, &s2); for(i=0; i<LOOPCOUNT; i++) { descgrad(100,99,0.01,0.00001); } clock_gettime(CLOCK_MONOTONIC, &e2); uint64_t dif1 = timespecDiff(&e1,&s1) / 1000; uint64_t dif2 = timespecDiff(&e2,&s2) / 1000; printf("time_iter:%llu ms, time_rec:%llu ms, ratio (dif1/dif2) :%g\n", dif1,dif2, ((double) ((double)dif1/(double)dif2))); printf("End. \n"); 

}

EDIT: After looking at the parsed output using objdump -dS I noticed a few things:
When optimizing -O3, the above code fully optimizes the function call. However, it still creates code for two functions, and none of them are actually recursive.

Secondly, with -O0, so the resulting code is actually recursive, the recursive version is literally a trillion times slower. My assumption is that the call stack causes the variables to end in memory, where the iterative version ends with registers and / or cache.

+2
source

The accepted answer is incorrect .

There is a difference in the execution time of the iterative function and the recursive function, and the reason is the compiler optimization -foptimize-sibling-calls added by -O3 .

First code:

 #include <stdio.h> #include <stdlib.h> #include <math.h> #include <time.h> #include <stdint.h> double descgrad(double xo, double xnew, double eps, double precision){ if (fabs(xnew - xo) <= precision) { return xnew; } else { return descgrad(xnew, xnew - eps*2*xnew, eps, precision); } } double descgraditer(double xo, double xnew, double eps, double precision){ double Xo = xo; double Xn = xnew; while(fabs(Xn-Xo) > precision){ Xo = Xn; Xn = Xo - eps * 2*Xo; } return Xn; } int main() { time_t s1, e1, d1, s2, e2, d2; int i, iter = 10000000; double a1, a2; s1 = time(NULL); for( i = 0; i < iter; i++ ){ a1 = descgraditer(100,99,0.01,0.00001); } e1 = time(NULL); d1 = difftime( e1, s1 ); s2 = time(NULL); for( i = 0; i < iter; i++ ){ a2 = descgrad(100,99,0.01,0.00001); } e2 = time(NULL); d2 = difftime( e2, s2 ); printf( "time_iter: %ds, time_rec: %ds, ratio (iter/rec): %f\n", d1, d2, (double)d1 / d2 ) ; printf( "return values: %f, %f\n", a1, a2 ); } 

The previous posts were correct, indicating that you need to repeat many times to average the interference in the environment. Given this, I discarded your differentiation function in favor of the time.h difftime function on the time_t data, since in many iterations, nothing finer than the second is meaningless. In addition, I removed printfs in the test.

I also fixed a bug in the recursive implementation. The if-statement source code has been checked for fabs(xnew-xo) < precision , which is incorrect (or at least different from the iterative implementation). Iterative loops, while fabs ()> precision, so a recursive function should not be recursively processed with precision <= . Adding iteration counts to both functions confirms that this fix makes the function logically equivalent.

Compile and run with -O3 :

 $ gcc test.c -O3 -lrt -o dg $ ./dg time_iter: 34 s, time_rec: 0 s, ratio (iter/rec): inf return values: 0.000487, 0.000487 

Compile and run without -O3

 $ gcc test.c -lrt -o dg $ ./dg time_iter: 54 s, time_rec: 90 s, ratio (iter/rec): 0.600000 return values: 0.000487, 0.000487 

Without optimization, iteration is performed BETTER than recursion.

In -O3 optimization, however, recursion starts ten millionth iterations in less than a second. The reason is that it adds -foptimize-sibling-calls , which optimizes voice calls with sister and tail, which is exactly what your recursive function uses.

Of course, I ran all -O3 optimizations EXCEPT -foptimize-sibling-calls :

 $ gcc test.c -lrt -o dg -fcprop-registers -fdefer-pop -fdelayed-branch -fguess-branch-probability -fif-conversion2 -fif-conversion -fipa-pure-const -fipa-reference -fmerge-constants -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-sra -ftree-ter -funit-at-a-time -fthread-jumps -falign-functions -falign-jumps -falign-loops -falign-labels -fcaller-saves -fcrossjumping -fcse-follow-jumps -fcse-skip-blocks -fdelete-null-pointer-checks -fexpensive-optimizations -fgcse -fgcse-lm -fpeephole2 -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop -fsched-interblock -fsched-spec -fschedule-insns -fschedule-insns2 -fstrict-aliasing -ftree-pre -ftree-vrp -finline-functions -funswitch-loops -fgcse-after-reload -ftree-vectorize $ ./dg time_iter: 55 s, time_rec: 89 s, ratio (iter/rec): 0.617978 return values: 0.000487, 0.000487 

Recursion without tail call optimization works worse than iteration, just like compiling with NO optimization. Read about compiler optimization here .

EDIT:

As a validation, I updated my code, including return values. In addition, I set two static variables to 0 and incremented each on recursion and iteration to check the correct result:

 int a = 0; int b = 0; double descgrad(double xo, double xnew, double eps, double precision){ if (fabs(xnew - xo) <= precision) { return xnew; } else { a++; return descgrad(xnew, xnew - eps*2*xnew, eps, precision); } } double descgraditer(double xo, double xnew, double eps, double precision){ double Xo = xo; double Xn = xnew; while(fabs(Xn-Xo) > precision){ b++; Xo = Xn; Xn = Xo - eps * 2*Xo; } return Xn; } int main() { time_t s1, e1, d1, s2, e2, d2; int i, iter = 10000000; double a1, a2; s1 = time(NULL); for( i = 0; i < iter; i++ ){ a1 = descgraditer(100,99,0.01,0.00001); } e1 = time(NULL); d1 = difftime( e1, s1 ); s2 = time(NULL); for( i = 0; i < iter; i++ ){ a2 = descgrad(100,99,0.01,0.00001); } e2 = time(NULL); d2 = difftime( e2, s2 ); printf( "time_iter: %ds, time_rec: %ds, ratio (iter/rec): %f\n", d1, d2, (double)d1 / d2 ) ; printf( "return values: %f, %f\n", a1, a2 ); printf( "number of recurs/iters: %d, %d\n", a, b ); } 

Exit:

 $ gcc optimization.c -O3 -lrt -o dg $ ./dg time_iter: 41 s, time_rec: 24 s, ratio (iter/rec): 1.708333 return values: 0.000487, 0.000487 number of recurs/iters: 1755032704, 1755032704 

The answers are the same, and the repetition is the same.

It is also interesting to note that the static variable fetching / incrementing has a significant effect on tail call optimization, however recursion is still performed by iteration.

+2
source

First, clock_gettime seems to measure wall clock time, not runtime. Secondly, the actual time you are measuring is printf runtime, not your function runtime. And thirdly, the first time you call printf , it is not in memory, so it must be unloaded using significant disk I / O. The reverse order you run the tests, and the results will also be inverted.

If you want to get some meaningful measurements, you must make sure that

  • only the code you want to measure is in the measurement cycles, or at least the additional code is very minimal compared to what you are measuring,
  • you do something with the results so that the compiler cannot optimize the whole code (not a problem in your tests),
  • you run the code to measure a large number of times, taking the average value,
  • you measure processor time, not wall clock time, and
  • You will make sure that everything is unloaded before starting the measurement.
+1
source

In many cases, the drawbacks of the modern hardware cache are a limiting factor in performance for small loop loops. A recursive implementation is less likely to create cache gaps in the instruction path.

0
source

All Articles