I'm trying to learn about vectorization by studying simple C code compiled in gcc with -O3 optimizations. In particular, how well compilers vectorize. This is a personal journey to being able to test gcc -O3 performance with more complex calculations. I understand that generally accepted wisdom is that compilers are better than humans, but I never take such wisdom for granted.
In my first simple test, however, I find some of the options that gcc makes are rather strange and, frankly, rudely sloppy in terms of optimization. I agree to assume that the compiler is targeted and knows something about the CPU (Intel i5-2557M in this case), which I do not know. But I need confirmation from knowledgeable people.
My simple test code (segment):
int i; float a[100]; for (i=0;i<100;i++) a[i]= (float) i*i;
The resulting assembly code (segment) corresponding to the for loop is as follows:
.L6: ; loop starts here movdqa xmm0, xmm1 ; copy packed integers in xmm1 to xmm0 .L3: movdqa xmm1, xmm0 ; wait, what!? WHY!? this is redundant. cvtdq2ps xmm0, xmm0 ; convert integers to float add rax, 16 ; increment memory pointer for next iteration mulps xmm0, xmm0 ; pack square all integers in xmm0 paddd xmm1, xmm2 ; pack increment all integers by 4 movaps XMMWORD PTR [rax-16], xmm0 ; store result cmp rax, rdx ; test loop termination jne .L6
I understand all the steps, and computationally, all of this makes sense. However, I do not understand that gcc decided to include in the iterative loop the step of loading xmm1 with xmm0 immediately after xmm0 . loaded using xmm1 . those.
.L6 movdqa xmm0, xmm1 ; loop starts here .L3 movdqa xmm1, xmm0 ; grrr!
It only makes me question the optimizerβs common sense. Obviously, the optional MOVDQA does not interfere with the data, but at face value it looks roughly sloppy on the part of gcc.
Earlier in the assembly code (not shown), xmm0 and xmm2 are initialized with some value significant for vectorization, therefore, obviously, at the beginning of the loop, the code should skip the first MOVDQA. But why gcc just does not rebuild, as shown below.
.L3 movdqa xmm1, xmm0 ; initialize xmm1 PRIOR to loop .L6 movdqa xmm0, xmm1 ; loop starts here
Or even better, just initialize xmm1 instead of xmm0 and reset the MOVDQA step xmm1 , xmm0
I am ready to believe that the processor is smart enough to skip an extra step or something like that, but how can I trust gcc to fully optimize complex code if it can even get this simple code correctly? Or can someone give a good explanation that will give me the belief that gcc-O3 is good stuff?