Why llvm and gcc use different function prologs on x86 64?

The trivial function that I am compiling with gcc and clang:

void test() { printf("hm"); printf("hum"); } 


 $ gcc test.c -fomit-frame-pointer -masm=intel -O3 -S sub rsp, 8 .cfi_def_cfa_offset 16 mov esi, OFFSET FLAT:.LC0 mov edi, 1 xor eax, eax call __printf_chk mov esi, OFFSET FLAT:.LC1 mov edi, 1 xor eax, eax add rsp, 8 .cfi_def_cfa_offset 8 jmp __printf_chk 

and

 $ clang test.c -mllvm --x86-asm-syntax=intel -fomit-frame-pointer -O3 -S # BB#0: push rax .Ltmp1: .cfi_def_cfa_offset 16 mov edi, .L.str xor eax, eax call printf mov edi, .L.str1 xor eax, eax pop rdx jmp printf # TAILCALL 

The difference I'm interested in is that gcc uses sub rsp, 8 / add rsp, 8 for the proog function, and clang uses push rax / pop rdx .

Why do compilers use different function prologs? Which option is better? push and pop , of course, encoded for shorter instructions, but faster or slower than add and sub ?

The reason for the stack is primarily because abi requires rsp to align 16 bytes for procedures without a sheet. I could not find compiler flags that remove them.

Judging by your answers, it seems that push and pop are better. push rax + pop rdx = 1 + 1 = 2 vs sub rsp, 8 + add rsp, 8 = 4 + 4 = 8 . Thus, the first pair saves 6 bytes at no cost.

+5
source share
2 answers

In Intel, sub / add will trigger the stack mechanism to insert an additional uop to synchronize %rsp for part of the out-of-order execution. (See the Agorn Fog microarch doc , in particular pg 91, for the stack engine. AFAIK, it still works on Haswell both on the Pentium M and when it needs to insert additional files.

push / pop will consume fewer valid fused-domain domains and will probably be more efficient even if they use store / load ports. They go between call / ret pairs.

So push / pop , at least, will not be slower, but it takes less command bytes. The best I-cache density is good.

By the way, I think that the point of the insns pair should support 16B stack alignment after the call , pushing the return address of 8B. This is one case where an ABI ultimately requires half-useful instructions. More complex functions that require some stack space to spill local objects and then reload them after function calls will usually be sub $something, %rsp to reserve space.

SystemV (Linux) amd64 ABI guarantees that when you (%rsp + 8) function (%rsp + 8) , where args in the stack will be, if any, will be aligned to 16B. ( http://x86-64.org/documentation/abi.pdf ). You must arrange this for any function that you call, or it is your mistake if they refuse to use SSE-aligned load. Or, otherwise, an error may occur when making assumptions about how they can use AND to mask an address or something else.

+8
source

According to the experiments I did on my machine, push/pop have the same speed as add/sub . I guess this should be the case for all mordern computers.

In any case, the difference (if any) is really microscopic, so I suggest you safely assume that they are equivalent ...

+1
source

All Articles