I see a 15% decrease in performance of the same C ++ code compiled for exactly the same machine commands, but located at different addresses. When my tiny main loop starts at 0x415220, it is faster than ever at 0x415250. I am running this on an Intel Core2 Duo. I am using gcc 4.4.5 on x86_64 Ubuntu.
Can someone explain the reason for the slowdown and how can I get gcc to optimally align the loop?
Here is a parsing for both cases with a profiler annotation:
415220 576 12.56% | XXXXXXXXXXXXXXX 48 c1 eb 08 shr $ 0x8,% rbx
415224 110 2.40% | XX 0f b6 c3 movzbl% bl,% eax
415227 0.00% | 41 0f b6 04 00 movzbl (% r8,% rax, 1),% eax
41522c 40 0.87% | 48 8b 04 c1 mov (% rcx,% rax, 8),% rax
415230 806 17.58% | XXXXXXXXXXXXXXXXXXXX 4c 63 f8 movslq% eax,% r15
415233 186 4.06% | XXXX 48 c1 e8 20 shr $ 0x20,% rax
415237 102 2.22% | XX 4c 01 f9 add% r15,% rcx
41523a 414 9.03% | XXXXXXXXXX a8 0f test $ 0xf,% al
41523c 680 14.83% | XXXXXXXXXXXXXXXXX 74 45 je 415283 :: Run (char const *, char const *) + 0x4b3>
41523e 0.00% | 41 89 c7 mov% eax,% r15d
415241 0.00% | 41 83 e7 01 and $ 0x1,% r15d
415245 0.00% | 41 83 ff 01 cmp $ 0x1,% r15d
415249 0.00% | 41 89 c7 mov% eax,% r15d
415250 679 13.05% | XXXXXXXXXXXXXXXXX 48 c1 eb 08 shr $ 0x8,% rbx
415254 124 2.38% | XX 0f b6 c3 movzbl% bl,% eax
415257 0.00% | 41 0f b6 04 00 movzbl (% r8,% rax, 1),% eax
41525c 43 0.83% | X 48 8b 04 c1 mov (% rcx,% rax, 8),% rax
415260 828 15.91% | XXXXXXXXXXXXXXXXXXXX 4c 63 f8 movslq% eax,% r15
415263 388 7.46% | XXXXXXXXX 48 c1 e8 20 shr $ 0x20,% rax
415267 141 2.71% | XXX 4c 01 f9 add% r15,% rcx
41526a 634 12.18% | XXXXXXXXXXXXXXXX a8 0f test $ 0xf,% al
41526c 749 14.39% | XXXXXXXXXXXXXXXXXX 74 45 je 4152b3 :: Run (char const *, char const *) + 0x4c3>
41526e 0.00% | 41 89 c7 mov% eax,% r15d
415271 0.00% | 41 83 e7 01 and $ 0x1,% r15d
415275 0.00% | 41 83 ff 01 cmp $ 0x1,% r15d
415279 0.00% | 41 89 c7 mov% eax,% r15d