X86 64-bit mode indexed branch

This is a continuation of some comments made in this previous thread:

Fibonacci Recursive Assembly

The following code fragments compute Fibonacci, the first example with a loop, the second example with a calculated jump (indexed branch) into an expanded loop. This was tested using Visual Studio 2015 Desktop Express in 64-bit mode of Windows 7 Pro with an Intel 3770K 3.5ghz processor. With one test cycle, fib (0) via fib (93), the best time I get for the version of the cycle is ~ 1.901 microseconds, and for the calculated jump ~ 1.324 microseconds. Using an external loop to repeat this process 1,048,576 times, the loop version takes about 1.44 seconds, the calculated jump takes about 1.04 seconds. In both test suites, the loop version is approximately 40% slower than the calculated transition version.

Question: Why is the loop version more sensitive to the location of the code than the calculated transition version? In previous tests, some combinations of the code location caused the cycle version time to increase from about 1.44 seconds to 1.93 seconds, but I never found a combination that significantly affected the calculation time of the calculated transition time.

Partial answer: the computed version of the branch branches into 94 possible target locations within 280 bytes, and apparently the target branch buffer (cache) does a good job of optimizing this. For the loop version, use align 16 to set the fib () function based on the assembly on a 16-byte boundary. In most cases, the loop version time problem was solved, but some changes in main () still affected the time. I need to find a fairly small and repeatable test case.

version of the loop (note I read that | dec | jnz | is faster than | loop |):

  align 16 fib proc ;rcx == n mov rax,rcx ;br if < 2 cmp rax,2 jb fib1 mov rdx,1 ;set rax, rdx and rax,rdx sub rdx,rax shr rcx,1 fib0: add rdx,rax add rax,rdx dec rcx jnz fib0 fib1: ret fib endp 

calculated transition (indexed branch) to the expanded cycle version:

  align 16 fib proc ;rcx == n mov r8,rcx ;set jmp adr mov r9,offset fib0+279 lea r8,[r8+r8*2] neg r8 add r8,r9 mov rax,rcx ;set rax,rdx mov rdx,1 and rax,rdx sub rdx,rax jmp r8 fib0: ; assumes add xxx,xxx takes 3 bytes rept 46 add rax,rdx add rdx,rax endm add rax,rdx ret fib endp 

Test code that runs 1 million (1048576) cycles to calculate fib(0) to fib(93) using a multiple of 37% 93, so the order is not sequential. On my system, the loop version took about 1.44 seconds, and the indexed version of the branch took about 1.04 seconds.

 #include <stdio.h> #include <time.h> typedef unsigned int uint32_t; typedef unsigned long long uint64_t; extern "C" uint64_t fib(uint64_t); /* multiples of 37 mod 93 + 93 at end */ static uint64_t a[94] = {0,37,74,18,55,92,36,73,17,54, 91,35,72,16,53,90,34,71,15,52, 89,33,70,14,51,88,32,69,13,50, 87,31,68,12,49,86,30,67,11,48, 85,29,66,10,47,84,28,65, 9,46, 83,27,64, 8,45,82,26,63, 7,44, 81,25,62, 6,43,80,24,61, 5,42, 79,23,60, 4,41,78,22,59, 3,40, 77,21,58, 2,39,76,20,57, 1,38, 75,19,56,93}; /* x used to avoid compiler optimizing out result of fib() */ int main() { size_t i, j; clock_t cbeg, cend; uint64_t x = 0; cbeg = clock(); for(j = 0; j < 0x100000; j++) for(i = 0; i < 94; i++) x += fib(a[i]); cend = clock(); printf("%llx\n", x); printf("# ticks = %u\n", (uint32_t)(cend-cbeg)); return 0; } 

The output for x is 0x812a62b1dc000000. The sum of fib (0) on fib (93) in hex is 0x1bb433812a62b1dc0 and adds another 5 zeros for a 0x100000 cycle: 0x1bb433812a62b1dc000000. The top 6 pieces are truncated due to 64-bit math.

I made the entire version of the assembly to better control the location of the code. "If 1" is replaced by "if 0" for the loop version. The version of the loop takes from 1.465 to 2.000 seconds depending on the nop padding used to place key positions at even or odd 16 byte boundaries (see comments below). The calculated version of the transition takes about 1.04 seconds, and the boundaries are less than 1% of the time difference.

  includelib msvcrtd includelib oldnames .data ; multiples of 37 mod 93 + 93 at the end a dq 0,37,74,18,55,92,36,73,17,54 dq 91,35,72,16,53,90,34,71,15,52 dq 89,33,70,14,51,88,32,69,13,50 dq 87,31,68,12,49,86,30,67,11,48 dq 85,29,66,10,47,84,28,65, 9,46 dq 83,27,64, 8,45,82,26,63, 7,44 dq 81,25,62, 6,43,80,24,61, 5,42 dq 79,23,60, 4,41,78,22,59, 3,40 dq 77,21,58, 2,39,76,20,57, 1,38 dq 75,19,56,93 .data? .code ; parameters rcx,rdx,r8,r9 ; not saved rax,rcx,rdx,r8,r9,r10,r11 ; code starts on 16 byte boundary main proc push r15 push r14 push r13 push r12 push rbp mov rbp,rsp and rsp,0fffffffffffffff0h sub rsp,64 mov r15,offset a xor r14,r14 mov r11,0100000h ; nop padding effect on loop version (with 0 padding in padx below) ; 0 puts main2 on odd 16 byte boundary clk = 0131876622h => 1.465 seconds ; 9 puts main1 on odd 16 byte boundary clk = 01573FE951h => 1.645 seconds rept 0 nop endm rdtsc mov r12,rdx shl r12,32 or r12,rax main0: xor r10,r10 main1: mov rcx,[r10+r15] call fib main2: add r14,rax add r10,8 cmp r10,8*94 jne main1 dec r11 jnz main0 rdtsc mov r13,rdx shl r13,32 or r13,rax sub r13,r12 mov rdx,r14 xor rax,rax mov rsp,rbp pop rbp pop r12 pop r13 pop r14 pop r15 ret main endp align 16 padx proc ; nop padding effect on loop version with 0 padding above ; 0 puts fib on odd 16 byte boundary clk = 0131876622h => 1.465 seconds ; 16 puts fib on even 16 byte boundary clk = 01A13C8CB8h => 2.000 seconds ; nop padding effect on computed jump version with 9 padding above ; 0 puts fib on odd 16 byte boundary clk = 00D979792Dh => 1.042 seconds ; 16 puts fib on even 16 byte boundary clk = 00DA93E04Dh => 1.048 seconds rept 0 nop endm padx endp if 1 ;0 = loop version, 1 = computed jump version fib proc ;rcx == n mov r8,rcx ;set jmp adr mov r9,offset fib0+279 lea r8,[r8+r8*2] neg r8 add r8,r9 mov rax,rcx ;set rax,rdx mov rdx,1 and rax,rdx sub rdx,rax jmp r8 fib0: ; assumes add xxx,xxx takes 3 bytes rept 46 add rax,rdx add rdx,rax endm add rax,rdx ret fib endp else fib proc ;rcx == n mov rax,rcx ;br if < 2 cmp rax,2 jb fib1 mov rdx,1 ;set rax, rdx and rax,rdx sub rdx,rax shr rcx,1 fib0: add rdx,rax add rax,rdx dec rcx jnz fib0 fib1: ret fib endp endif end 
+2
source share
1 answer

This was the answer to the original question: why the cycle takes 1.4 times, and the time of the calculated version of the transition, when the result is not fully used. The IDK is exactly why accumulating a result with a 1-cycle dependency chain associated with the t20> cycle will be so important. Interesting things to try: store it in memory (for example, assign it volatile int discard ), so that the dep asm chain not only ends with a broken register. HW can optimize this (for example, discard uops as soon as it is sure that the result is dead). Intel says the Sandybridge family can do this for one of the result flags in shl reg,cl format .


Old answer: Why is the calculated jump 1.4 times faster than the loop with the result is not used

This is where you test bandwidth, not latency. In our previous discussion, I mainly focused on latency. Perhaps it was a mistake; the effect of bandwidth on the caller can often be as relevant as the waiting time, depending on how much of what the caller does after it has data dependency on the result.

Out-of-order execution hides the delay because the result of one call is not an input dependency for the argument for the next call. And the IvyBridge window is out of order large enough to be useful here: a 168-entry ROB (from exit to retirement) and a 54-entry scheduler (from release to run) and a physical register file of 160 entries. See Also PRF and ROB for LLC window size .

Fulfillment LLC also hides the cost of branching before Fib is completed. Work from the last chain fib(n) dep is still in flight and should work during this incorrect prediction. (Modern Intel processors only roll back to the erroneous branch and can continue to execute uops from the branch until the wrong branch is resolved.)

It makes sense that the version with calculated branches is good here, because you are mostly narrowly focused on uop bandwidth, and an incorrect forecast from the loop-exit branch costs about the same as an incorrect prediction about an indirect branch when you enter the expanded version. IvB can with a sub/jcc in one uop for port 5, so a 40% number matches pretty well. (3 blocks of ALU execution, therefore the costs of 1/3 or ALU execution on the overhead of the cycle explain this. The difference in network errors and the limits of LLC execution explain the rest)


I think that in most real-world use cases, relevance may be relevant. Maybe bandwidth will still be the most important, but something else besides this will make latency more important because it doesn't even use the result at all. Of course, it’s normal that the previous work will work during the work, while the erroneous prediction with indirect branching is restored, but this will delay the result, which will be ready, which can mean avalanches later if the majority of instructions after fib() return depends on the result . But if they are not (for example, a lot of reboots and address calculations for where to put the result), most likely, the beginning of the first release of uops after fib() is a good thing.

I think that a good middle level here will unfold at 4 or 8, with a check before the deployed loop to make sure that it should run once. (e.g. sub rcx,8 / jb .cleanup ).


Also note that your version of the loop has a dependency on n for initial values. In a previous discussion, I pointed out that avoiding this would be better for out-of-order execution, since it allows the add chain to work until n ready. I don't think there is a big factor here because the caller has a low latency for n . But this leads to an incorrect prediction of the loop branch when exiting the loop at the end of the chain dep n β†’ fib(n) , and not in the middle. (I present the branching lea / cmov after the loop to do another iteration if sub ecx, 2 drops below zero instead of zero.)

+1
source

All Articles