Cycle unfolding in built-in functions in C

Question

Cycle unfolding in built-in functions in C

I have a question about optimizing the C compiler and when / how loops are deployed inside built-in functions.

I am developing a numerical code that does something like the example below. Basically, my_for() would calculate some kind of stencil and call op() to do something with the data in my_type *arg for each i . Here my_func() wraps my_for() , creating an argument and sending a pointer to my_op() ... whos job, you need to change i th double for each of ( arg->n ) double arrays arg->dest[j] .

 typedef struct my_type { int const n; double *dest[16]; double const *src[16]; } my_type; static inline void my_for( void (*op)(my_type *,int), my_type *arg, int N ) { int i; for( i=0; i<N; ++i ) op( arg, i ); } static inline void my_op( my_type *arg, int i ) { int j; int const n = arg->n; for( j=0; j<n; ++j ) arg->dest[j][i] += arg->src[j][i]; } void my_func( double *dest0, double *dest1, double const *src0, double const *src1, int N ) { my_type Arg = { .n = 2, .dest = { dest0, dest1 }, .src = { src0, src1 } }; my_for( &my_op, &Arg, N ); }

It works great. Functions are inserted as required, and the code is (almost) efficient, as it writes all the built-in functions into one function and expands the j loop without any my_type Arg .

There is confusion here: if I set int const n = 2; , not int const n = arg->n; in my_op() , then the code becomes as fast as the deployed single-functional version. So the question is: why? If everything is embedded in my_func() , why doesn't the compiler see that I literally define Arg.n = 2 ? Also, there is no improvement when I explicitly make an evaluation in the j loop arg->n , which should look the same as the faster int const n = 2; after attachment. I also tried using my_type const everywhere to really signal this constant to the compiler, but it just doesn't want to loop around.

In my numerical code, this is about 15% of the performance. If that matters, there n=4 and these loops j appear on several conditional branches in op() .

I am compiling with icc (ICC) 12.1.5 20120612. I tried #pragma unroll . Here are my compiler options (I missed some good ones):

-O3 -ipo -static -unroll-aggressive -fp-model precise -fp-model source -openmp -std=gnu99 -Wall -Wextra -Wno-unused -Winline -pedantic

Thanks!

+8

optimization c inline icc

Finiteelement Jun 12 '15 at 11:32

source share

2 answers

This is faster because your program does not assign memory to a variable.

If you do not need to perform any operations with unknown values, they are processed as if they were #define constant 2 with type checking. They are simply added at compile time.

Could you choose one of the two tags (I mean C or C ++), this is confusing because languages treat const values differently - C treats them like ordinary variables whose value is simply impossible to change, and in C ++, they have or do not have memory assigned depending on the context (if you need their address, or if you need to calculate them when the program is running, then memory is assigned).

Source: C ++ Thinking. No exact quote.

+2

Adrian jałoszewski Jun 12 '15 at 11:47

source share

egur · Accepted Answer · 2015-06-12T12:05:08+0000

Well, it’s obvious that the compiler is not smart enough to propagate the constant n and expand the for loop. This is actually safe, since arg->n can change between creation and use.

To ensure consistent performance across generations of the compiler and squeeze the most out of your code, perform a manual rollout.

The fact that people like me in these situations (performance is king) depends on macros.

Macros will be "embedded" in debug builds (useful) and can be templated (to the point) using macro parameters. Macro parameters, which are compile-time constants, are guaranteed this way.

Cycle unfolding in built-in functions in C

More articles: