Reset build code in optimized C code

I'm trying to learn about vectorization by studying simple C code compiled in gcc with -O3 optimizations. In particular, how well compilers vectorize. This is a personal journey to being able to test gcc -O3 performance with more complex calculations. I understand that generally accepted wisdom is that compilers are better than humans, but I never take such wisdom for granted.

In my first simple test, however, I find some of the options that gcc makes are rather strange and, frankly, rudely sloppy in terms of optimization. I agree to assume that the compiler is targeted and knows something about the CPU (Intel i5-2557M in this case), which I do not know. But I need confirmation from knowledgeable people.

My simple test code (segment):

int i; float a[100]; for (i=0;i<100;i++) a[i]= (float) i*i; 

The resulting assembly code (segment) corresponding to the for loop is as follows:

 .L6: ; loop starts here movdqa xmm0, xmm1 ; copy packed integers in xmm1 to xmm0 .L3: movdqa xmm1, xmm0 ; wait, what!? WHY!? this is redundant. cvtdq2ps xmm0, xmm0 ; convert integers to float add rax, 16 ; increment memory pointer for next iteration mulps xmm0, xmm0 ; pack square all integers in xmm0 paddd xmm1, xmm2 ; pack increment all integers by 4 movaps XMMWORD PTR [rax-16], xmm0 ; store result cmp rax, rdx ; test loop termination jne .L6 

I understand all the steps, and computationally, all of this makes sense. However, I do not understand that gcc decided to include in the iterative loop the step of loading xmm1 with xmm0 immediately after xmm0 . loaded using xmm1 . those.

  .L6 movdqa xmm0, xmm1 ; loop starts here .L3 movdqa xmm1, xmm0 ; grrr! 

It only makes me question the optimizer’s common sense. Obviously, the optional MOVDQA does not interfere with the data, but at face value it looks roughly sloppy on the part of gcc.

Earlier in the assembly code (not shown), xmm0 and xmm2 are initialized with some value significant for vectorization, therefore, obviously, at the beginning of the loop, the code should skip the first MOVDQA. But why gcc just does not rebuild, as shown below.

 .L3 movdqa xmm1, xmm0 ; initialize xmm1 PRIOR to loop .L6 movdqa xmm0, xmm1 ; loop starts here 

Or even better, just initialize xmm1 instead of xmm0 and reset the MOVDQA step xmm1 , xmm0

I am ready to believe that the processor is smart enough to skip an extra step or something like that, but how can I trust gcc to fully optimize complex code if it can even get this simple code correctly? Or can someone give a good explanation that will give me the belief that gcc-O3 is good stuff?

+7
optimization c assembly gcc
source share
1 answer

I'm not 100% sure, but it looks like your loop destroys xmm0 by converting it to float , so that you get the integer value in xmm1 and then copy it to another register (in this case xmm0 ).

Although it is known that compilers sometimes issue unnecessary instructions, I cannot understand how this happens in this case.

If you want xmm0 (or xmm1 ) to remain an integer, then there is no float for the first value of i . You might want to do this:

  for (i=0;i<100;i++) a[i]= (float)(i*i); 

But on the other hand, gcc 4.9.2 does not seem to do this:

 g++ -S -O3 floop.cpp .L2: cvtdq2ps %xmm1, %xmm0 mulps %xmm0, %xmm0 addq $16, %rax paddd %xmm2, %xmm1 movaps %xmm0, -16(%rax) cmpq %rbp, %rax jne .L2 

Doesn't clang (3.7.0 about 3 weeks ago)

  clang++ -S -O3 floop.cpp movdqa .LCPI0_0(%rip), %xmm0 # xmm0 = [0,1,2,3] xorl %eax, %eax .align 16, 0x90 .LBB0_1: # %vector.body # =>This Inner Loop Header: Depth=1 movd %eax, %xmm1 pshufd $0, %xmm1, %xmm1 # xmm1 = xmm1[0,0,0,0] paddd %xmm0, %xmm1 cvtdq2ps %xmm1, %xmm1 mulps %xmm1, %xmm1 movaps %xmm1, (%rsp,%rax,4) addq $4, %rax cmpq $100, %rax jne .LBB0_1 

The code I compiled:

 extern int printf(const char *, ...); int main() { int i; float a[100]; for (i=0;i<100;i++) a[i]= (float) i*i; for (i=0; i < 100; i++) printf("%f\n", a[i]); } 

(I added printf to avoid the compiler getting rid of ALL code)

+4
source share

All Articles