How effective is the alignment of functions on modern processors?

When I compile C code with a recent compiler on an amd64 or x86 system, the functions are aligned with a multiple of 16 bytes. How important is alignment for modern processors? Is there a huge performance penalty associated with calling an unbalanced function?

Benchmark

I ran the following microobject ( call.S ):

 // benchmarking performance penalty of function alignment. #include <sys/syscall.h> #ifndef SKIP # error "SKIP undefined" #endif #define COUNT 1073741824 .globl _start .type _start,@function _start: mov $COUNT,%rcx 0: call test dec %rcx jnz 0b mov $SYS_exit,%rax xor %edi,%edi syscall .size _start,.-_start .align 16 .space SKIP test: nop rep ret .size test,.-test 

with the following shell script:

 #!/bin/sh for i in `seq 0 15` ; do echo SKIP=$i cc -c -DSKIP=$i call.S ld -o call call.o time -p ./call done 

On a processor that identifies itself as Intel (R) Core (TM) i7-2760QM CPU @ 2.40GHz according to /proc/cpuinfo . The offset did not affect me, the reference chart lasted 1.9 seconds.

On the other hand, in another system with a processor that communicates itself as an Intel i7 processor with an Intel (R) Core i7 processor with a frequency of 6.13 GHz, this breakpoint takes 6.3 seconds, unless you have an offset 14 or 15, where the code takes 7.2 seconds. I think that since the function starts spanning multiple lines of cache.

+7
performance assembly x86-64 alignment
source share
1 answer

TL; DR : cache alignment issues. You do not need bytes that you will not execute.

At the very least, you want to avoid getting instructions before the first one you follow. Since this is a micro benchmark, you most likely do not see any difference, but imagine a complete program if you have an extra cache miss for a bunch of functions, because the first byte was not aligned with the cache and you end up in the end, I had to extract a new cache line for the last N bytes of the function (where N <= the number of bytes before the function that you cached but did not use).

Intel Optimization Guide says the following:

3.4.1.5 Code alignment

Careful code layout can improve cache and memory. Probable sequences of base blocks should be laid out contiguously in memory. This may be due to the removal of an unlikely code from the sequence, such as code for handling error conditions. See Section 3.7, “Prefetching,” optimization of the command preselect.

3-8 Collection of assembly / compiler. Rule 12. (M-hit, generality) . All chain goals must be aligned by 16 bytes.

Assembly / compiler rule. Rule 13. (M impact, H generality) . If the body of the conditional expression is unlikely to be satisfied, it should be placed in another part of the program. If this is very unlikely and there is a problem with the code, the problem should be placed on another code page.

It also helps explain why you did not notice any differences in your program. All code receives caching once and never leaves the cache (of course, context context).

+3
source share

All Articles