How does loop address alignment affect speed on Intel x86_64?

I see a 15% decrease in performance of the same C ++ code compiled for exactly the same machine commands, but located at different addresses. When my tiny main loop starts at 0x415220, it is faster than ever at 0x415250. I am running this on an Intel Core2 Duo. I am using gcc 4.4.5 on x86_64 Ubuntu.

Can someone explain the reason for the slowdown and how can I get gcc to optimally align the loop?

Here is a parsing for both cases with a profiler annotation:

  415220 576 12.56% | XXXXXXXXXXXXXXX 48 c1 eb 08 shr $ 0x8,% rbx
   415224 110 2.40% | XX 0f b6 c3 movzbl% bl,% eax
   415227 0.00% |  41 0f b6 04 00 movzbl (% r8,% rax, 1),% eax
   41522c 40 0.87% |  48 8b 04 c1 mov (% rcx,% rax, 8),% rax
   415230 806 17.58% | XXXXXXXXXXXXXXXXXXXX 4c 63 f8 movslq% eax,% r15
   415233 186 4.06% | XXXX 48 c1 e8 20 shr $ 0x20,% rax
   415237 102 2.22% | XX 4c 01 f9 add% r15,% rcx
   41523a 414 9.03% | XXXXXXXXXX a8 0f test $ 0xf,% al
   41523c 680 14.83% | XXXXXXXXXXXXXXXXX 74 45 je 415283 :: Run (char const *, char const *) + 0x4b3>
   41523e 0.00% |  41 89 c7 mov% eax,% r15d
   415241 0.00% |  41 83 e7 01 and $ 0x1,% r15d
   415245 0.00% |  41 83 ff 01 cmp $ 0x1,% r15d
   415249 0.00% |  41 89 c7 mov% eax,% r15d
  415250 679 13.05% | XXXXXXXXXXXXXXXXX 48 c1 eb 08 shr $ 0x8,% rbx
   415254 124 2.38% | XX 0f b6 c3 movzbl% bl,% eax
   415257 0.00% |  41 0f b6 04 00 movzbl (% r8,% rax, 1),% eax
   41525c 43 0.83% | X 48 8b 04 c1 mov (% rcx,% rax, 8),% rax
   415260 828 15.91% | XXXXXXXXXXXXXXXXXXXX 4c 63 f8 movslq% eax,% r15
   415263 388 7.46% | XXXXXXXXX 48 c1 e8 20 shr $ 0x20,% rax
   415267 141 2.71% | XXX 4c 01 f9 add% r15,% rcx
   41526a 634 12.18% | XXXXXXXXXXXXXXXX a8 0f test $ 0xf,% al
   41526c 749 14.39% | XXXXXXXXXXXXXXXXXX 74 45 je 4152b3 :: Run (char const *, char const *) + 0x4c3>
   41526e 0.00% |  41 89 c7 mov% eax,% r15d
   415271 0.00% |  41 83 e7 01 and $ 0x1,% r15d
   415275 0.00% |  41 83 ff 01 cmp $ 0x1,% r15d
   415279 0.00% |  41 89 c7 mov% eax,% r15d
+7
source share
2 answers

Gcc has the option -falign-loops = n , where n is the maximum number of bytes to skip if the machine is omitted by default. Gcc automatically resolves this at the -O2 and -O3 levels.

+4
source

On Intel processors that have Loop Stream Detection, alignment of the loop body code can increase efficiency, especially at normal reversal levels. Alignment pays a fine the first time you enter a loop from above. You didn’t show the code where, in the aligned case, there would be somewhat meaningless illustrious no-op instructions. gcc usually uses conditional alignment, which applies alignment only when a limited number of indentation is required. When I studied it once, the options that affect this behavior did not look very effective. As Alexander said, it is important to set the value to -march or -mtune so that gcc can use the appropriate alignment settings. All the compilers I use cannot align the body of the loop for some cases, and there is no control over this.

+2
source

All Articles