What GCC optimization arises in this for the loop?

Using gcc 4.6 with -O3, I programmed the following four codes using a simple time command

#include <iostream> int main(int argc, char* argv[]) { double val = 1.0; unsigned int numIterations = 1e7; for(unsigned int ii = 0;ii < numIterations;++ii) { val *= 0.999; } std::cout<<val<<std::endl; } 

Case 1 works after 0.09 seconds

 #include <iostream> int main(int argc, char* argv[]) { double val = 1.0; unsigned int numIterations = 1e8; for(unsigned int ii = 0;ii < numIterations;++ii) { val *= 0.999; } std::cout<<val<<std::endl; } 

Case 2 works after 17.6 seconds

 int main(int argc, char* argv[]) { double val = 1.0; unsigned int numIterations = 1e8; for(unsigned int ii = 0;ii < numIterations;++ii) { val *= 0.999; } } 

Case 3 works after 0.8 seconds

 #include <iostream> int main(int argc, char* argv[]) { double val = 1.0; unsigned int numIterations = 1e8; for(unsigned int ii = 0;ii < numIterations;++ii) { val *= 0.999999; } std::cout<<val<<std::endl; } 

Case 4 works after 0.8 seconds

My question is: why is the second case so slower than all other cases? Case 3 shows that removing cout returns the runtime as expected. And case 4 shows that changing the multiplier also significantly reduces the execution time. What optimization or optimization does not occur in case 2 and why?

Update:

When I initially ran these tests, there were no separate numIterations variables, the value was hard-coded in a for loop. In general, hard-coding this value made things work slower than the cases given here. This is especially true for case 3, which started almost instantly with the numIterations variable, as shown above, which indicates that James McNellis is right about the whole optimized loop. I'm not sure why hard-coded 1e8 in a for loop prevents deleting a loop in Case 3 or makes things slower in other cases, however, the underlying premise of Case 2 is much slower - this is even more true.

The difference in assembly output for the above cases gives

Case 2 and Case 1:
movl $ 100,000,000, 16 (% esp)


movl $ 10,000,000, 16 (% esp)

Case 2 and Case 4:
.long -652835029
.long 1072691150


.long -417264663
.long 1072693245

+7
source share
4 answers

Rene Richter was on the right track with regard to the lower level. The smallest positive normalized number is around 2.2e-308. With f (n) = 0.999 ** n, this limit is reached after about 708,148 iterations. The rest of the iterations are stuck with abnormal calculations.

This explains why 100 million iterations take just over 10 times the time it takes to complete 10 million. The first 700,000 are carried out using floating point equipment. As soon as you hit denormalized numbers, floating point hardware; Multiplication is performed in software.

Please note that this is not the case if the correctly calculated calculation is 0.999 ** N. Ultimately, the product will reach zero, and from this point the multiplications will again be performed using floating point equipment. This is not so, because 0.999 * (the smallest denormalized number) is the smallest denormalized number. An ongoing product eventually ends.

We can change the exhibitor. A score of 0.999999 will keep the continuous product within normalized numbers for 708 million iterations. Taking advantage of this

 Case A : #iterations = 1e7, exponent=0.999, time=0.573692 sec Case B : #iterations = 1e8, exponent=0.999, time=6.10548 sec Case C : #iterations = 1e7, exponent=0.999999, time=0.018867 sec Case D : #iterations = 1e8, exponent=0.999999, time=0.189375 sec 

Here you can easily see how much faster floating point hardware is than software emulation.

+8
source

Compile with the -S option and look at the generated assembler output (files named * .s).

Edit:

In program 3, the cycle is deleted because the result is not used.

In case 1, 2 and 4, let’s say some mathematics: The base 10 logarithm of the result of case 1 is 1e7 * log10 (0.999) = -4345 (roughly). For case 2, we get 1e8 * log10 (0.999) = -43451. For case 4, this is 1e8 * log10 (0.9999) = -4343. The result itself is equal to pow (10, logarithm).

The floating point module uses 80-bit internal bits inside the x86 / x64 processor. When the number becomes less than 1.9E-4951, we get a floating-point failure, as @James Kanze pointed out. This only happens in case 2. I don’t know why it takes longer than a normalized result, maybe someone can explain.

+8
source

I am running g++ (Ubuntu 4.4.3-4ubuntu5) 4.4.3 on 64-bit Linux. I used -O3 just like you.

Here are my results of the time when I do the tests:

  • Case 2: 6.81 s
  • Case 3: 0.00 s
  • Case 4: 0.20 s

I looked at the assembly of all three tests.

Case 3 is fast because GCC really optimizes the entire for loop. The main function is simple (after deleting labels and .cfi* operators):

 main: xorl %eax, %eax ret 

The only difference in the assembly for case 2 and case 3 is a constant that supposedly represents 0.999 or 0.999999:

 $ diff case2.s case4.s 121,122c121,122 < .long 3642132267 < .long 1072691150 --- > .long 3877702633 > .long 1072693245 

This is the only difference in the assembly. GCC performed the same set of optimizations for case 2 and case 4, but case 2 takes 30 times more cases 4. My conclusion is that floating point multiplication by x64 architecture should take a variable amount of time depending on what values you are multiplying. This should not be a very surprising statement, but it would be nice if someone who knows more about x64 could explain to us why this is true.

I have not carefully studied case 1, because you only do 1e7 iterations instead of 1e8, so of course it will take less time.

+1
source

For what it's worth, I conducted several tests in all four versions, with the following results:

  • Firstly, I have the same discrepancies in speed as yours. More or less: I have a slow car, and I see distinct differences between tests 1, 3 and 4. But they remain two orders of magnitude faster than test 2.

  • Test 3 was the fastest: looking at the generated code shows that g ++ deleted the loop because the results were not used.

  • Test 1 was a little more than a tenth than test 4. About what one would expect, more or less.

  • And the only difference in the generated code between tests 1, 3 and 4 is a constant. There is no difference in the loop.

Thus, the difference is not related to compiler optimization. I can only assume that this has something to do with flow control. (But then I would expect a slowdown in cycle 1.)

+1
source

All Articles