I have a question about optimizing the C compiler and when / how loops are deployed inside built-in functions.
I am developing a numerical code that does something like the example below. Basically, my_for() would calculate some kind of stencil and call op() to do something with the data in my_type *arg for each i . Here my_func() wraps my_for() , creating an argument and sending a pointer to my_op() ... whos job, you need to change i th double for each of ( arg->n ) double arrays arg->dest[j] .
typedef struct my_type { int const n; double *dest[16]; double const *src[16]; } my_type; static inline void my_for( void (*op)(my_type *,int), my_type *arg, int N ) { int i; for( i=0; i<N; ++i ) op( arg, i ); } static inline void my_op( my_type *arg, int i ) { int j; int const n = arg->n; for( j=0; j<n; ++j ) arg->dest[j][i] += arg->src[j][i]; } void my_func( double *dest0, double *dest1, double const *src0, double const *src1, int N ) { my_type Arg = { .n = 2, .dest = { dest0, dest1 }, .src = { src0, src1 } }; my_for( &my_op, &Arg, N ); }
It works great. Functions are inserted as required, and the code is (almost) efficient, as it writes all the built-in functions into one function and expands the j loop without any my_type Arg .
There is confusion here: if I set int const n = 2; , not int const n = arg->n; in my_op() , then the code becomes as fast as the deployed single-functional version. So the question is: why? If everything is embedded in my_func() , why doesn't the compiler see that I literally define Arg.n = 2 ? Also, there is no improvement when I explicitly make an evaluation in the j loop arg->n , which should look the same as the faster int const n = 2; after attachment. I also tried using my_type const everywhere to really signal this constant to the compiler, but it just doesn't want to loop around.
In my numerical code, this is about 15% of the performance. If that matters, there n=4 and these loops j appear on several conditional branches in op() .
I am compiling with icc (ICC) 12.1.5 20120612. I tried #pragma unroll . Here are my compiler options (I missed some good ones):
-O3 -ipo -static -unroll-aggressive -fp-model precise -fp-model source -openmp -std=gnu99 -Wall -Wextra -Wno-unused -Winline -pedantic
Thanks!