C compilers and loop optimization

Question

C compilers and loop optimization

I do not have much experience in how compiler optimizers are optimized and what is the difference between different levels (-O2 vs -O3 for gcc, for example). So I'm not sure if the following two statements are equivalent for an arbitrary compiler:

for(i=0;i<10;++i){ variable1*variable2*gridpoint[i]; }

and

 variable3=variable1*variable2; for(i=0;i<10;++i){ variable3*gridpoint[i]; }

From the point of view of processing time, it would be advisable only to calculate the product of variables 1 and variable 2 once, since they do not change in the cycle. This requires additional memory, but I'm not sure how much the optimizer affects this overhead. The first expression is easiest to read if you have an equation from a paper / book and you want to translate it to something that can be read on a computer, but the second may be the fastest - especially for more complex equations with a large number of constant variables inside the loop (I have some rather unpleasant nonlinear differential equations that I would like to read in person in code). Do any of these changes change if I declare my variables constant? I hope my question makes sense for an arbitrary compiler, since I use the gcc, Intel, and Portland compilers.

+7

optimization c

user787267 Jan 9 '13 at 16:41

source share

3 answers

Yes, you can count on compilers to do a good job of doing subexpression exceptions, even with loops. This can lead to a slight increase in memory usage, however, all this will be considered by any worthy compiler, and it almost always happens that this is a victory for eliminating sub-expressions (since the memory we are talking about is registers and L1).

Here are some quick tests to prove it to yourself. The results show that you should basically not try to outwit the compiler that manually excludes sub-expression expressions, just code naturally and let the compiler do what it is good at (something like figuring out what expressions should be really eliminated and which should not be given away by the target architecture and surrounding code.)

Later, if you are not satisfied with the efficiency of your code, you should take a profiler to your code and see which statements and expressions eat the most time, and then try to find out if you can reorganize the code to help the compiler, but I would say that in in most cases, it will not be just things like this, it will do everything possible to reduce the number of kiosks (i.e., organize your data better), eliminating redundant interprocedural calculations, and the like.

(The FTR using randoms in the following code simply ensures that the compiler cannot be too hard on variable exclusion and loop unrolling)

prog1:

 #include <stdlib.h> #include <time.h> int main () { srandom(time(NULL)); int i, ret = 0, a = random(), b = random(), values[10]; int loop_end = random() % 5 + 1000000000; for (i=0; i < 10; ++i) { values[i] = random(); } for (i = 0; i < loop_end; ++i) { ret += a * b * values[i % 10]; } return ret; }

PROG2:

 #include <stdlib.h> #include <time.h> int main () { srandom(time(NULL)); int i, ret = 0, a = random(), b = random(), values[10]; int loop_end = random() % 5 + 1000000000; for (i=0; i < 10; ++i) { values[i] = random(); } int c = a * b; for (i = 0; i < loop_end; ++i) { ret += c * values[i % 10]; } return ret; }

And here are the results:

 > gcc -O2 prog1.c -o prog1; time ./prog1 ./prog1 1.62s user 0.00s system 99% cpu 1.630 total > gcc -O2 prog2.c -o prog2; time ./prog2 ./prog2 1.63s user 0.00s system 99% cpu 1.636 total

(This is a measurement of time on the wall, so do not pay attention to the difference of 0.01 seconds, working in it several times, when they both fall into the range of 1.62-1.63 seconds, so they have the same speed)

Interestingly, prog1 was faster at compilation without optimization:

 > gcc -O0 prog1.c -o prog1; time ./prog1 ./prog1 2.83s user 0.00s system 99% cpu 2.846 total > gcc -O0 prog2.c -o prog2; time ./prog2 ./prog2 2.93s user 0.00s system 99% cpu 2.946 total

Also interesting, compiling with -O1 provided better performance.

 gcc -O1 prog1.c -o prog1; time ./prog1 ./prog1 1.57s user 0.00s system 99% cpu 1.579 total gcc -O1 prog2.c -o prog2; time ./prog2 ./prog2 1.56s user 0.00s system 99% cpu 1.563 total

GCC and Intel are great compilers and are pretty smart about such things. I have no experience working with the Portland compiler, but these are pretty simple things for the compiler, so I would be very surprised if it could not cope with such situations.

+3

hexist Jan 9 '13 at 17:18

source share

If I were a compiler, I would understand that both of these loops have non- left operands and there are no side effects at all (except setting i to 10 ), so I just completely optimize the loops.

I am not saying that this is actually happening; it just looks like it could come from the code you provided.

0

abelenky Jan 9 '13 at 17:43

source share

user405725 · Accepted Answer · 2013-01-09T17:47:23+0000

It is difficult to answer this question adequately for an arbitrary compiler. What can be done with this code depends not only on the compiler, but also on the target architecture. I will try to explain what a production compiler can do with good features for this code.

From the point of view of processing time, it would be advisable only to calculate the product of variables1 and variable2 once, since they do not change in the cycle.

You're right. And, as Mr. Kat noted, this is called eliminating a common subexpression . Thus, the compiler can generate code that evaluates an expression only once (or even evaluates it at compile time if the values for the two operands are known to be constant at a time).

A decent compiler can also perform subexpression exceptions for functions if it can determine that functions have no side effects. For example, GCC can analyze a function if its body is available, but there are also pure and const attributes that can be used to specifically label functions that should undergo this optimization (see Function Attributes ).

Given that there is no side effect, and the compiler can determine it (there is nothing worthwhile in your example), two of these fragments are equivalent in this respect (I checked with clang :-)).

This requires additional memory, but I'm not sure how much the optimizer affects this overhead.

In fact, this does not require additional memory. Multiplication is performed in the processor registers , and the result is also stored in the register. It is about eliminating a large amount of code and using a single register to store a result that is always large (and, of course, makes life easier when it comes to register allocation , especially in a loop). Therefore, if this optimization can be performed, it will be done at no additional cost.

The first expression is easiest to read.

Both GCC and Clang will perform this optimization. However, I'm not sure about other compilers, so you have to check it yourself. But it is hard to imagine any good compiler that does not perform subexpression exception.

Will any of these changes change if I declare my variables as constants?

It can be. This is called a constant expression — an expression that contains only constants. A constant expression can be evaluated at compile time, rather than at run time. So, for example, if you have multiple A, B, and C, where both A and B are constants, the compiler will precompute the expression A*B only a few C against this precomputed value. Compilers can also do this even with inconsistent values, if they can determine its value at compile time and make sure that it is not changed. For example:

 $ cat test.c inline int foo(int a, int b) { return a * b; } int main() { int a; int b; a = 1; b = 2; return foo(a, b); } $ clang -Wall -pedantic -O4 -o test ./test.c $ otool -tv ./test ./test: (__TEXT,__text) section _main: 0000000100000f70 movl $0x00000002,%eax 0000000100000f75 ret

There is another optimization that may occur in the case of the above fragments. Below are some of them that come to mind:

The first most obvious one is looping through. Since the number of iterations is known at runtime, the compiler may decide to deploy the loop . Regardless of whether this optimization is applied or not, it depends on the architecture (that is, some processors can “lock your loop” and execute code faster than its deployed version, which also makes the code more convenient for the cache, using less space, avoiding additional synthesis cycles μOP, etc.).

The second optimization, which can literally speed up work up to 50 times, is the SIMD instruction (SSE, AVX, etc.). For example, GCC is very good at this (Intel should be too, if not better). I checked that the following function:

 uint8_t dumb_checksum(const uint8_t *p, size_t size) { uint8_t s = 0; size_t i; for (i = 0; i < size; ++i) s = (uint8_t)(s + p[i]); return s; }

... is converted to a loop, where each step sums 16 values at once (i.e., as in _mm_add_epi8 ) with an additional code, alignment processing and an odd (<16) number of iterations. However, Klang completely failed the last time I checked. Thus, GCC can also shorten your loop in the same way, even if the number of iterations is unknown.

And if I can offer you not to optimize your code, if you do not find it a bottleneck. Otherwise, you can spend a lot of time on false and premature optimization.

Hope this answers your questions. Good luck

C compilers and loop optimization

More articles: