Why does 32-byte loop alignment make code faster?

Question

Why does 32-byte loop alignment make code faster?

Take a look at this code:

one.cpp:

bool test(int a, int b, int c, int d); int main() { volatile int va = 1; volatile int vb = 2; volatile int vc = 3; volatile int vd = 4; int a = va; int b = vb; int c = vc; int d = vd; int s = 0; __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop"); for (int i=0; i<2000000000; i++) { s += test(a, b, c, d); } return s; }

two.cpp:

 bool test(int a, int b, int c, int d) { // return a == d || b == d || c == d; return false; }

There is 16 nop in one.cpp. You can comment / uncomment them to change the alignment of the loop entry point between 16 and 32. I compiled them with g++ one.cpp two.cpp -O3 -mtune=native .

Here are my questions:

The 32-aligned version is faster than the 16-line version. On Sandy Bridge the difference is 20%; on Haswell, 8%. What is the difference?
with a 32-line version, the code runs at the same speed on Sandy Bridge, it doesn't matter which return statement is in the two.cpp file. I thought the return false version should be faster, at least a little. But no, the exact same speed!
If I remove volatile from one.cpp, the code becomes slower (Haswell: before: ~ 2.17 s, after: ~ 2.38 sec). Why is this? But this only happens when the loop is aligned to 32.

The fact that the 32-aligned version is faster is strange to me because the Intel® 64 Architecture and IA-32 Optimization Reference Guide says (p. 3-9):

Assembly / compiler rule. Rule 12. (M-impact, generality H) . The entire target branch must be aligned by 16 bytes.

Another small question: are there any tricks to make this loop only 32-aligned (so the rest of the code could continue to use 16-byte alignment)?

Note. I tried the gcc 6, gcc 7 and clang 3.9 compilers, the same results.

Here is the code with volatile (the code is the same for 16/32 alignment, only the address is different):

 0000000000000560 <main>: 560: 41 57 push r15 562: 41 56 push r14 564: 41 55 push r13 566: 41 54 push r12 568: 55 push rbp 569: 31 ed xor ebp,ebp 56b: 53 push rbx 56c: bb 00 94 35 77 mov ebx,0x77359400 571: 48 83 ec 18 sub rsp,0x18 575: c7 04 24 01 00 00 00 mov DWORD PTR [rsp],0x1 57c: c7 44 24 04 02 00 00 mov DWORD PTR [rsp+0x4],0x2 583: 00 584: c7 44 24 08 03 00 00 mov DWORD PTR [rsp+0x8],0x3 58b: 00 58c: c7 44 24 0c 04 00 00 mov DWORD PTR [rsp+0xc],0x4 593: 00 594: 44 8b 3c 24 mov r15d,DWORD PTR [rsp] 598: 44 8b 74 24 04 mov r14d,DWORD PTR [rsp+0x4] 59d: 44 8b 6c 24 08 mov r13d,DWORD PTR [rsp+0x8] 5a2: 44 8b 64 24 0c mov r12d,DWORD PTR [rsp+0xc] 5a7: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] 5ac: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0] 5b3: 00 00 00 5b6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0] 5bd: 00 00 00 5c0: 44 89 e1 mov ecx,r12d 5c3: 44 89 ea mov edx,r13d 5c6: 44 89 f6 mov esi,r14d 5c9: 44 89 ff mov edi,r15d 5cc: e8 4f 01 00 00 call 720 <test(int, int, int, int)> 5d1: 0f b6 c0 movzx eax,al 5d4: 01 c5 add ebp,eax 5d6: 83 eb 01 sub ebx,0x1 5d9: 75 e5 jne 5c0 <main+0x60> 5db: 48 83 c4 18 add rsp,0x18 5df: 89 e8 mov eax,ebp 5e1: 5b pop rbx 5e2: 5d pop rbp 5e3: 41 5c pop r12 5e5: 41 5d pop r13 5e7: 41 5e pop r14 5e9: 41 5f pop r15 5eb: c3 ret 5ec: 0f 1f 40 00 nop DWORD PTR [rax+0x0]

Without variability:

 0000000000000560 <main>: 560: 55 push rbp 561: 31 ed xor ebp,ebp 563: 53 push rbx 564: bb 00 94 35 77 mov ebx,0x77359400 569: 48 83 ec 08 sub rsp,0x8 56d: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0] 574: 00 00 576: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0] 57d: 00 00 00 580: b9 04 00 00 00 mov ecx,0x4 585: ba 03 00 00 00 mov edx,0x3 58a: be 02 00 00 00 mov esi,0x2 58f: bf 01 00 00 00 mov edi,0x1 594: e8 47 01 00 00 call 6e0 <test(int, int, int, int)> 599: 0f b6 c0 movzx eax,al 59c: 01 c5 add ebp,eax 59e: 83 eb 01 sub ebx,0x1 5a1: 75 dd jne 580 <main+0x20> 5a3: 48 83 c4 08 add rsp,0x8 5a7: 89 e8 mov eax,ebp 5a9: 5b pop rbx 5aa: 5d pop rbp 5ab: c3 ret 5ac: 0f 1f 40 00 nop DWORD PTR [rax+0x0]

+10

performance gcc benchmarking x86-64 clang

geza Jul 25 '17 at 9:14

source share

1 answer

Peter Cordes · Accepted Answer · 2017-07-29 16:55

This does not correspond to point 2 ( return a == d || b == d || c == d; at the same speed as return false ). This is another interesting question, as it should compile several instructions with instructions like uop-cache.

The fact that the 32-aligned version is faster is strange to me because [Intel Guide says to level to 32]

This optimization advice tip is a very general guide and certainly does not mean that it never helps again. This is usually not the case, and padding to 32 will rather hurt than help. (I-cache misses, ITLB omissions, and more bytes of code to load from disk).

In fact, 16B alignment is rarely required, especially on processors with a uop cache. For a small loop that can be executed from a loop buffer, alignment usually does not matter.

16B is still not bad, as a broad recommendation, but it does not tell you everything you need to know in order to understand one specific case on several specific processors.

Compilers usually by default align branch branches and function entry points, but usually do not align other branch targets. The cost of executing NOPs (and code swelling) is often greater than the likely cost of a non-configured branch goal without a loop.

Code alignment has some direct and some indirect effects. Direct effects include the uop cache in the Intel SnB family. For example, see Aligning Branches for Loops Using Microcoded Instructions for Intel SnB Processors .

Another section of Intel Optimization Guide details how the uop cache works:

2.3.2.2 Decoded ICache :
All microoperations in the path (the uop cache line) are commands that are statically contiguous in the code and have their EIP within the same aligned 32-byte region. (I think this means indicating that it goes through the border, goes to the uop cache for the block containing its beginning, not the end. The stretching instructions should go somewhere, and the destination address of the branch that will start the instruction is the beginning of insn, so it’s most useful to put it in a string for this block).
The multi micro-op command cannot be split between paths.
The instruction, which includes MSROM, uses all the way.
Up to two branches are allowed.
A pair of macro-configured instructions are stored as one micro-operator.

See also Agner Fog Microargate Guide . He adds:

An unconditional jump or call always ends with the μop cache line
many other things that are probably not relevant here.

Also, if your code is not suitable for the uop cache, it cannot work from the loop buffer.

Indirect alignment effects include:

Larger / smaller code size (missing L1I cache, TLB). Not relevant to your test.
which associates aliases with each other in the BTB (destination buffer buffer).

If I remove volatile from one.cpp, the code will become slower. Why is this?

Larger instructions push the last instruction in a loop across border 32B:

  59e: 83 eb 01 sub ebx,0x1 5a1: 75 dd jne 580 <main+0x20>

So, if you are not working from the loop buffer (LSD), then without volatile one of the uop-cache fetch cycles gets only 1 uop.

If sub / jne are macro fuses, this may not apply. And I think that only crossing the 64B border will break the macro merge.

In addition, these are not real addresses. Have you checked which addresses after linking? There may be a 64B border after layout if the text section has alignment less than 64B.

Sorry, I have not actually tested this to say more about this particular case. The fact is that when you encounter an external interface, for example, call / ret inside a tight loop, alignment becomes important and can become extremely complicated . Border crossing or not for all future instructions. Do not expect this to be easy. If you read my other answers, you will realize that I’m usually not the kind of person to say “it’s too difficult to fully explain,” but alignment may be like that.

See also Alignment of code in one object file affects the performance of a function in another object file.

In your case, make sure the tiny features are built-in. Use connection time optimization if your codebase has any important tiny functions in separate .c files and not in .h where they can be embedded. Or modify your code to put them in .h .

Why does 32-byte loop alignment make code faster?

More articles: