Why llvm and gcc use different function prologs on x86 64?

Question

Why llvm and gcc use different function prologs on x86 64?

The trivial function that I am compiling with gcc and clang:

void test() { printf("hm"); printf("hum"); }

 $ gcc test.c -fomit-frame-pointer -masm=intel -O3 -S sub rsp, 8 .cfi_def_cfa_offset 16 mov esi, OFFSET FLAT:.LC0 mov edi, 1 xor eax, eax call __printf_chk mov esi, OFFSET FLAT:.LC1 mov edi, 1 xor eax, eax add rsp, 8 .cfi_def_cfa_offset 8 jmp __printf_chk

and

 $ clang test.c -mllvm --x86-asm-syntax=intel -fomit-frame-pointer -O3 -S # BB#0: push rax .Ltmp1: .cfi_def_cfa_offset 16 mov edi, .L.str xor eax, eax call printf mov edi, .L.str1 xor eax, eax pop rdx jmp printf # TAILCALL

The difference I'm interested in is that gcc uses sub rsp, 8 / add rsp, 8 for the proog function, and clang uses push rax / pop rdx .

Why do compilers use different function prologs? Which option is better? push and pop , of course, encoded for shorter instructions, but faster or slower than add and sub ?

The reason for the stack is primarily because abi requires rsp to align 16 bytes for procedures without a sheet. I could not find compiler flags that remove them.

Judging by your answers, it seems that push and pop are better. push rax + pop rdx = 1 + 1 = 2 vs sub rsp, 8 + add rsp, 8 = 4 + 4 = 8 . Thus, the first pair saves 6 bytes at no cost.

+5

c assembly gcc x86-64 llvm

Björn lindqvist Jul 21 '15 at 11:12

source share

2 answers

According to the experiments I did on my machine, push/pop have the same speed as add/sub . I guess this should be the case for all mordern computers.

In any case, the difference (if any) is really microscopic, so I suggest you safely assume that they are equivalent ...

+1

Whatsup Jul 21 '15 at 11:29

source share

Peter Cordes · Accepted Answer · 2015-07-21T11:29:45+0000

In Intel, sub / add will trigger the stack mechanism to insert an additional uop to synchronize %rsp for part of the out-of-order execution. (See the Agorn Fog microarch doc , in particular pg 91, for the stack engine. AFAIK, it still works on Haswell both on the Pentium M and when it needs to insert additional files.

push / pop will consume fewer valid fused-domain domains and will probably be more efficient even if they use store / load ports. They go between call / ret pairs.

So push / pop , at least, will not be slower, but it takes less command bytes. The best I-cache density is good.

By the way, I think that the point of the insns pair should support 16B stack alignment after the call , pushing the return address of 8B. This is one case where an ABI ultimately requires half-useful instructions. More complex functions that require some stack space to spill local objects and then reload them after function calls will usually be sub $something, %rsp to reserve space.

SystemV (Linux) amd64 ABI guarantees that when you (%rsp + 8) function (%rsp + 8) , where args in the stack will be, if any, will be aligned to 16B. ( http://x86-64.org/documentation/abi.pdf ). You must arrange this for any function that you call, or it is your mistake if they refuse to use SSE-aligned load. Or, otherwise, an error may occur when making assumptions about how they can use AND to mask an address or something else.

Why llvm and gcc use different function prologs on x86 64?

More articles: