Which gcc option enables loopback for embedded SSEs with immediate operands?

Question

Which gcc option enables loopback for embedded SSEs with immediate operands?

This question relates to gcc (4.6.3 Ubuntu) and its behavior in scan loops for embedded SSE operations with immediate operands.

An example of an inline with a direct operand is _mm_blend_ps. It expects a 4-bit instantaneous integer, which can only be a constant. However, using the -O3 option, the compiler seems to automatically expand the loops (if the values of the loop counter can be determined at compile time) and create several instances of the corresponding blending command with different immediate values.

This is a simple test code (blendsimple.c) that goes through 16 possible values of the immediate operand of the blend:

#include <stdio.h> #include <x86intrin.h> #define PRINT(V) \ printf("%s: ", #V); \ for (i = 3; i >= 0; i--) printf("%3g ", V[i]); \ printf("\n"); int main() { __m128 a = _mm_set_ps(1, 2, 3, 4); __m128 b = _mm_set_ps(5, 6, 7, 8); int i; PRINT(a); PRINT(b); unsigned mask; __m128 r; for (mask = 0; mask < 16; mask++) { r = _mm_blend_ps(a, b, mask); PRINT(r); } return 0; }

You can compile this code with

 gcc -Wall -march=native -O3 -o blendsimple blendsimple.c

and the code works. Obviously, the compiler unrolls the loop and inserts constants for the immediate operand.

However, if you compile the code with

 gcc -Wall -march=native -O2 -o blendsimple blendsimple.c

You will get the following error for the inline blend:

 error: the last argument must be a 4-bit immediate

Now I tried to figure out which specific compiler flag is active in -O3, but not in -O2, which allows the compiler to expand the loop, but failed. Following gcc's interactive docs

https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/Overall-Options.html

I executed the following commands:

 gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts diff /tmp/O2-opts /tmp/O3-opts | grep enabled

which lists all options enabled by -O3 but not by -O2. When I add all 7 specified flags in addition to -O2

 gcc -Wall -march=native -O2 -fgcse-after-reload -finline-functions -fipa-cp-clone -fpredictive-commoning -ftree-loop-distribute-patterns -ftree-vectorize -funswitch-loops blendsimple blendsimple.c

I would expect the behavior to be exactly the same as with -O3. However, the compiler complains that "the last argument must be 4-bit immediate."

Does anyone have an idea what the problem is? I think it would be nice to know which flag is required to enable such a loop reversal so that it can be selectively activated using the #pragma GCC optimization or function attribute.

(I was also surprised that -O3 obviously did not even enable the unroll-loops option).

I would be grateful for any help. This is for the SSE programming lecture that I give.

Edit: Thank you very much for your comments. jtaylor seems right. I got a hand on two new versions of gcc (4.7.3, 4.8.2) and 4.8.2 complains of an immediate problem regardless of the level of optimization. Moverover, I later noticed that gcc 4.6.3 compiles code with -O2 -funroll-loop, but this also does not work in 4.8.2. Therefore, apparently, this function cannot be trusted and should always be deployed “manually” using cpp or templates, as pointed out by Jason R.

+7

c gcc sse

Raalf Jul 18 '14 at 11:22

source share

1 answer

pAndrei · Answer 1 · 2015-01-16T11:44:59+0000

I'm not sure if this applies to your situation, as I am not familiar with the internal functions of SSE. But as a rule, you can tell the compiler to specifically optimize a section of code with:

  #pragma GCC push_options #pragma GCC optimize ("unroll-loops") do your stuff #pragma GCC pop_options

Source: Tell gcc to specifically expand the loop.

Which gcc option enables loopback for embedded SSEs with immediate operands?

More articles: