Why do some C compilers set function return value in strange places?

I wrote this snippet in a recent argument over the estimated speed of array[i++] vs array[i]; i++ array[i]; i++ .

 int array[10]; int main(){ int i=0; while(i < 10){ array[i] = 0; i++; } return 0; } 

Snippet in compiler explorer: https://godbolt.org/g/de7TY2

As expected, the compiler outputs identical asm for array[i++] and array[i]; i++ array[i]; i++ not less than -O1 . However, I was surprised that placing xor eax, eax would seem to be random in the function at higher levels of optimization.


NCA

In -O2 , GCC puts xor in front of ret , as expected

  mov DWORD PTR [rax], 0 add rax, 4 cmp rax, OFFSET FLAT:array+40 jne .L2 xor eax, eax ret 

However, it puts xor after the second mov in -O3

  mov QWORD PTR array[rip], 0 mov QWORD PTR array[rip+8], 0 xor eax, eax mov QWORD PTR array[rip+16], 0 mov QWORD PTR array[rip+24], 0 mov QWORD PTR array[rip+32], 0 ret 

MOGO

icc usually puts it in -O1 :

  push rsi xor esi, esi push 3 pop rdi call __intel_new_feature_proc_init stmxcsr DWORD PTR [rsp] xor eax, eax or DWORD PTR [rsp], 32832 ldmxcsr DWORD PTR [rsp] ..B1.2: mov DWORD PTR [array+rax*4], 0 inc rax cmp rax, 10 jl ..B1.2 xor eax, eax pop rcx ret 

but in a strange place at -O2

  push rbp mov rbp, rsp and rsp, -128 sub rsp, 128 xor esi, esi mov edi, 3 call __intel_new_feature_proc_init stmxcsr DWORD PTR [rsp] pxor xmm0, xmm0 xor eax, eax or DWORD PTR [rsp], 32832 ldmxcsr DWORD PTR [rsp] movdqu XMMWORD PTR array[rip], xmm0 movdqu XMMWORD PTR 16+array[rip], xmm0 mov DWORD PTR 32+array[rip], eax mov DWORD PTR 36+array[rip], eax mov rsp, rbp pop rbp ret 

and -O3

  and rsp, -128 sub rsp, 128 mov edi, 3 call __intel_new_proc_init stmxcsr DWORD PTR [rsp] xor eax, eax or DWORD PTR [rsp], 32832 ldmxcsr DWORD PTR [rsp] mov rsp, rbp pop rbp ret 

Clang

only clang puts xor immediately before ret at all optimization levels:

  xorps xmm0, xmm0 movaps xmmword ptr [rip + array+16], xmm0 movaps xmmword ptr [rip + array], xmm0 mov qword ptr [rip + array+32], 0 xor eax, eax ret 

Since GCC and ICC do this at higher levels of optimization, I believe there must be some good reason.

Why do some compilers do this?

Of course, the code is semantically identical, and the compiler can change it as it wishes, but since it only changes at higher levels of optimization, this should be caused by some optimization.

+8
optimization c assembly gcc compilation
source share
4 answers

Different instructions have different delays. Sometimes reordering instructions can speed up code for several reasons. For example: If a particular instruction requires several cycles, if it is at the end of a function, the program simply waits until it is executed. If it used to be in a function, other things can happen while this instruction completes. This is hardly the actual reason here, although if you think about it, since the xor registers, I think the instruction is low latency. Although the delay depends on the processor.

However, placing the XOR there may be due to the separation of the mov instructions between which it is located.

There are also optimizations that take advantage of the optimization capabilities of modern processors, such as pipelining, branch prediction (here is not the way I see ...), etc. To understand these features, you need a pretty deep understanding of what the optimizer can do to take advantage of them.

You may find this informative. He pointed me to the Agner Fog website , a resource that I had not seen before, but contains a lot of information that you wanted (or did not want :-)) to learn, but were afraid to ask :-)

+5
source share

Since eax not used, compilers can zero case when they want, and it works as expected.

Interestingly, you did not notice the icc -O2 version:

 xor eax, eax or DWORD PTR [rsp], 32832 ldmxcsr DWORD PTR [rsp] movdqu XMMWORD PTR array[rip], xmm0 movdqu XMMWORD PTR 16+array[rip], xmm0 mov DWORD PTR 32+array[rip], eax ; set to 0 using the value of eax mov DWORD PTR 36+array[rip], eax 

note that eax is nullified for the return value, but also used for zero 2 memory areas (last 2 commands), probably because a command using eax shorter than an instruction with a direct null operand.

So, two birds with one stone.

+6
source share

It is expected that these memory accesses will record at least a few clock cycles. You can move xor without changing the functionality of the code. Pulling it back with one or more memory accesses after it, it becomes free, does not require any execution time, it is parallel to external access (the processor finishes xor and waits for an external action, not just waiting for external activity), If you put him to a bunch of instructions without access to memory, it costs at least hours. And, as you probably know, using xor vs mov immediately reduces the size of the instruction, maybe it doesn't cost hours, but it saves space in binary format. Ghee whiz seems to be a cool optimization that goes back to the original 8086 and is still in use today, even if it doesn't ultimately save you.

+3
source share

When the processor sets a specific value, it depends on the moment of transmission of the execution tree, it is sure that this register will no longer be needed and will not be changed by the outside world.

Here's a non-trivial example: https://godbolt.org/g/6AowMJ

And the processor resets the eax after the memset, because the memset can change its value. The moment depends on the analysis of a complex tree, and this is possibly illogical for people.

-one
source share

All Articles