An empty loop is slower than a nonempty one in C

Question

An empty loop is slower than a nonempty one in C

When trying to find out how long the line of C code was running, I noticed this strange thing:

int main (char argc, char * argv[]) { time_t begin, end; uint64_t i; double total_time, free_time; int A = 1; int B = 1; begin = clock(); for (i = 0; i<(1<<31)-1; i++); end = clock(); free_time = (double)(end-begin)/CLOCKS_PER_SEC; printf("%f\n", free_time); begin = clock(); for (i = 0; i<(1<<31)-1; i++) { A += B%2; } end = clock(); free_time = (double)(end-begin)/CLOCKS_PER_SEC; printf("%f\n", free_time); return(0); }

What when performing impressions:

 5.873425 4.826874

Why does an empty loop use more time than the second in which there is a command? Of course, I tried many options, but each time, an empty loop takes longer than one with one instruction inside.

Please note that I tried to arrange the order of the loops and add the warm-up code, and it did not change my problem at all.

I use code blocks as an IDE with the GNU gcc compiler, linux ubuntu 14.04 and I have 2.4 GHz Quadcore intel i5 (I tried to run the program on one core, this does not change the result).

+64

performance c loops

Celerio Jul 31 '14 at 20:07

source share

4 answers

Assuming your code uses a 32-bit integer int (which your system probably does), nothing can be determined from your code. Instead, it demonstrates undefined behavior.

 foo.c:5:5: error: first parameter of 'main' (argument count) must be of type 'int' int main (char argc, char * argv[]) { ^ foo.c:13:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow] for (i = 0; i<(1<<31)-1; i++); ^ foo.c:19:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow] for (i = 0; i<(1<<31)-1; i++) { ^

Let's try to fix this:

 #include <stdint.h> #include <stdio.h> #include <time.h> #include <limits.h> int main (int argc, char * argv[]) { time_t begin, end; uint64_t i; double total_time, free_time; int A = 1; int B = 1; begin = clock(); for (i = 0; i<INT_MAX; i++); end = clock(); free_time = (double)(end-begin)/CLOCKS_PER_SEC; printf("%f\n", free_time); begin = clock(); for (i = 0; i<INT_MAX; i++) { A += B%2; } end = clock(); free_time = (double)(end-begin)/CLOCKS_PER_SEC; printf("%f\n", free_time); return(0); }

Now let's look at the assembly of this code. Personally, I think the internal LLVM build is very readable, so I'm going to show it. I will create it by doing:

 clang -O3 foo.c -S -emit-llvm -std=gnu99

Here is the corresponding part of the output (main function):

 define i32 @main(i32 %argc, i8** nocapture readnone %argv) #0 { %1 = tail call i64 @"\01_clock"() #3 %2 = tail call i64 @"\01_clock"() #3 %3 = sub nsw i64 %2, %1 %4 = sitofp i64 %3 to double %5 = fdiv double %4, 1.000000e+06 %6 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %5) #3 %7 = tail call i64 @"\01_clock"() #3 %8 = tail call i64 @"\01_clock"() #3 %9 = sub nsw i64 %8, %7 %10 = sitofp i64 %9 to double %11 = fdiv double %10, 1.000000e+06 %12 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %11) #3 ret i32 0 }

Note that there is no operation between calls to clock() for any case . So they are both compiled for exactly the same .

+78

Bill Lynch Jul 31 '14 at 20:18

source share

This answer assumes that you have already understood and considered excellent points regarding the undefined behavior that sharth does in his answer . It also points out tricks that the compiler can play on your code. You must take steps to make sure that the compiler does not recognize the whole loop as worthless. For example, changing an iterator declaration to volatile uint64_t i; prevent loop deletion, and volatile int A; ensures that the second cycle will actually work more than the first. But even if you do all this, you can still find that:

Code later in the program may execute faster than the previous code.

The clock() library function could cause icache to skip after reading the timer and before returning. This will cause some additional time in the first measured interval. (For later calls, the code is already in the cache). However, this effect will be tiny, of course, too small for clock() to measure, even if it was a page error all the way to the disk. Random context switches can add to a time interval.

More importantly, you have an i5 processor that has a dynamic beat. When your program starts running, the clock speed is likely to be low because the processor is idle. Just starting the program makes the processor no longer work, so after a short delay the clock speed will increase. The relationship between the idle speed and the TurboBoosted CPU frequency can be significant. (On my Haswell i5-4200U ultrabook, the previous multiplier is 8 and the second is 26, which makes the startup code less than 30% as fast as the later code! Calibrated loops to implement delays are a terrible idea on modern computers!)

Including the warm-up phase (repeating the test and discarding the first result) for a more accurate time, not only for managed frameworks with JIT compilers!

+30

Ben Voigt Jul 31 '14 at 20:30

source share

I can reproduce this with GCC 4.8.2-19ubuntu1 without optimization:

 $ ./a.out 4.780179 3.762356

Here is an empty loop:

 0x00000000004005af <+50>: addq $0x1,-0x20(%rbp) 0x00000000004005b4 <+55>: cmpq $0x7fffffff,-0x20(%rbp) 0x00000000004005bc <+63>: jb 0x4005af <main+50>

And here is nonempty:

 0x000000000040061a <+157>: mov -0x24(%rbp),%eax 0x000000000040061d <+160>: cltd 0x000000000040061e <+161>: shr $0x1f,%edx 0x0000000000400621 <+164>: add %edx,%eax 0x0000000000400623 <+166>: and $0x1,%eax 0x0000000000400626 <+169>: sub %edx,%eax 0x0000000000400628 <+171>: add %eax,-0x28(%rbp) 0x000000000040062b <+174>: addq $0x1,-0x20(%rbp) 0x0000000000400630 <+179>: cmpq $0x7fffffff,-0x20(%rbp) 0x0000000000400638 <+187>: jb 0x40061a <main+157>

Insert a nop into an empty loop:

 0x00000000004005af <+50>: nop 0x00000000004005b0 <+51>: addq $0x1,-0x20(%rbp) 0x00000000004005b5 <+56>: cmpq $0x7fffffff,-0x20(%rbp) 0x00000000004005bd <+64>: jb 0x4005af <main+50>

Now they work equally fast:

 $ ./a.out 3.846031 3.705035

I suppose this shows the importance of alignment, but I'm afraid I can't say exactly how: |

+27

that other guy Jul 31. '14 at 21:17

source share

gnasher729 · Accepted Answer · 2014-08-01 09:59

The fact is that modern processors are complex. All completed instructions will interact with each other in complex and interesting ways. Thanks for the "other person" for sending the code.

Both OP and “this other guy” seem to have found that a short cycle takes 11 cycles and a long cycle takes 9 cycles. For a long cycle, 9 cycles is a lot of time, although there are many operations. There should be some stall for a short loop, caused by the fact that it was so short, and just adding nop makes the loop long enough to avoid a break.

One thing that arises if we look at the code:

 0x00000000004005af <+50>: addq $0x1,-0x20(%rbp) 0x00000000004005b4 <+55>: cmpq $0x7fffffff,-0x20(%rbp) 0x00000000004005bc <+63>: jb 0x4005af <main+50>

We read i and write it back ( addq ). We read it again and compare ( cmpq ). And then we go in cycles. But the loop uses branch prediction. Therefore, while addq is addq , the processor is not really sure that it is allowed to write on i (since branch prediction may be wrong).

Then we compare with i . The processor will try to avoid reading i from memory, because reading takes a lot of time. Instead, a certain amount of hardware will remember that we just wrote i , adding it to it, and instead of reading i , the cmpq command cmpq data from the storage instruction. Unfortunately, we are not sure if this actually happened in i or not! Thus, a stall could be introduced here.

The problem here is that the conditional jump, addq , which leads to conditional storage, and cmpq , which are not sure where to get the data, are very close to each other. They are unusually close to each other. Maybe they are so close to each other that at this moment the processor cannot understand whether to take i from the store's instructions or read it from memory. And it reads it from a memory that is slower because it has to wait for the store to finish. And adding just one nop gives the processor enough time.

Usually you think that there is RAM, and there is a cache. On a modern Intel processor, read memory can be read from (from the slowest to the fastest):

Memory (RAM)
L3 cache (optional)
L2 cache
L1 cache
Previous store instruction, which has not yet been written to the L1 cache.

What the processor does internally in a short, slow cycle:

Read i from cache L1
Add 1 to i
Write i to cache L1
Wait until i is written to cache L1
Read i from cache L1
Compare i with INT_MAX
Refer to (1) if it is less.

In a long, fast cycle, the processor performs:

Many things
Read i from cache L1
Add 1 to i
Make a store instruction that will write i to L1 cache
Read i directly from the store instruction without touching the L1 cache
Compare i with INT_MAX
Refer to (1) if it is less.

An empty loop is slower than a nonempty one in C

More articles: