This is a continuation of some comments made in this previous thread:
Fibonacci Recursive Assembly
The following code fragments compute Fibonacci, the first example with a loop, the second example with a calculated jump (indexed branch) into an expanded loop. This was tested using Visual Studio 2015 Desktop Express in 64-bit mode of Windows 7 Pro with an Intel 3770K 3.5ghz processor. With one test cycle, fib (0) via fib (93), the best time I get for the version of the cycle is ~ 1.901 microseconds, and for the calculated jump ~ 1.324 microseconds. Using an external loop to repeat this process 1,048,576 times, the loop version takes about 1.44 seconds, the calculated jump takes about 1.04 seconds. In both test suites, the loop version is approximately 40% slower than the calculated transition version.
Question: Why is the loop version more sensitive to the location of the code than the calculated transition version? In previous tests, some combinations of the code location caused the cycle version time to increase from about 1.44 seconds to 1.93 seconds, but I never found a combination that significantly affected the calculation time of the calculated transition time.
Partial answer: the computed version of the branch branches into 94 possible target locations within 280 bytes, and apparently the target branch buffer (cache) does a good job of optimizing this. For the loop version, use align 16 to set the fib () function based on the assembly on a 16-byte boundary. In most cases, the loop version time problem was solved, but some changes in main () still affected the time. I need to find a fairly small and repeatable test case.
version of the loop (note I read that | dec | jnz | is faster than | loop |):
align 16 fib proc ;rcx == n mov rax,rcx ;br if < 2 cmp rax,2 jb fib1 mov rdx,1 ;set rax, rdx and rax,rdx sub rdx,rax shr rcx,1 fib0: add rdx,rax add rax,rdx dec rcx jnz fib0 fib1: ret fib endp
calculated transition (indexed branch) to the expanded cycle version:
align 16 fib proc ;rcx == n mov r8,rcx ;set jmp adr mov r9,offset fib0+279 lea r8,[r8+r8*2] neg r8 add r8,r9 mov rax,rcx ;set rax,rdx mov rdx,1 and rax,rdx sub rdx,rax jmp r8 fib0: ; assumes add xxx,xxx takes 3 bytes rept 46 add rax,rdx add rdx,rax endm add rax,rdx ret fib endp
Test code that runs 1 million (1048576) cycles to calculate fib(0) to fib(93) using a multiple of 37% 93, so the order is not sequential. On my system, the loop version took about 1.44 seconds, and the indexed version of the branch took about 1.04 seconds.
#include <stdio.h> #include <time.h> typedef unsigned int uint32_t; typedef unsigned long long uint64_t; extern "C" uint64_t fib(uint64_t); /* multiples of 37 mod 93 + 93 at end */ static uint64_t a[94] = {0,37,74,18,55,92,36,73,17,54, 91,35,72,16,53,90,34,71,15,52, 89,33,70,14,51,88,32,69,13,50, 87,31,68,12,49,86,30,67,11,48, 85,29,66,10,47,84,28,65, 9,46, 83,27,64, 8,45,82,26,63, 7,44, 81,25,62, 6,43,80,24,61, 5,42, 79,23,60, 4,41,78,22,59, 3,40, 77,21,58, 2,39,76,20,57, 1,38, 75,19,56,93}; /* x used to avoid compiler optimizing out result of fib() */ int main() { size_t i, j; clock_t cbeg, cend; uint64_t x = 0; cbeg = clock(); for(j = 0; j < 0x100000; j++) for(i = 0; i < 94; i++) x += fib(a[i]); cend = clock(); printf("%llx\n", x); printf("# ticks = %u\n", (uint32_t)(cend-cbeg)); return 0; }
The output for x is 0x812a62b1dc000000. The sum of fib (0) on fib (93) in hex is 0x1bb433812a62b1dc0 and adds another 5 zeros for a 0x100000 cycle: 0x1bb433812a62b1dc000000. The top 6 pieces are truncated due to 64-bit math.
I made the entire version of the assembly to better control the location of the code. "If 1" is replaced by "if 0" for the loop version. The version of the loop takes from 1.465 to 2.000 seconds depending on the nop padding used to place key positions at even or odd 16 byte boundaries (see comments below). The calculated version of the transition takes about 1.04 seconds, and the boundaries are less than 1% of the time difference.
includelib msvcrtd includelib oldnames .data ; multiples of 37 mod 93 + 93 at the end a dq 0,37,74,18,55,92,36,73,17,54 dq 91,35,72,16,53,90,34,71,15,52 dq 89,33,70,14,51,88,32,69,13,50 dq 87,31,68,12,49,86,30,67,11,48 dq 85,29,66,10,47,84,28,65, 9,46 dq 83,27,64, 8,45,82,26,63, 7,44 dq 81,25,62, 6,43,80,24,61, 5,42 dq 79,23,60, 4,41,78,22,59, 3,40 dq 77,21,58, 2,39,76,20,57, 1,38 dq 75,19,56,93 .data? .code ; parameters rcx,rdx,r8,r9 ; not saved rax,rcx,rdx,r8,r9,r10,r11 ; code starts on 16 byte boundary main proc push r15 push r14 push r13 push r12 push rbp mov rbp,rsp and rsp,0fffffffffffffff0h sub rsp,64 mov r15,offset a xor r14,r14 mov r11,0100000h ; nop padding effect on loop version (with 0 padding in padx below) ; 0 puts main2 on odd 16 byte boundary clk = 0131876622h => 1.465 seconds ; 9 puts main1 on odd 16 byte boundary clk = 01573FE951h => 1.645 seconds rept 0 nop endm rdtsc mov r12,rdx shl r12,32 or r12,rax main0: xor r10,r10 main1: mov rcx,[r10+r15] call fib main2: add r14,rax add r10,8 cmp r10,8*94 jne main1 dec r11 jnz main0 rdtsc mov r13,rdx shl r13,32 or r13,rax sub r13,r12 mov rdx,r14 xor rax,rax mov rsp,rbp pop rbp pop r12 pop r13 pop r14 pop r15 ret main endp align 16 padx proc ; nop padding effect on loop version with 0 padding above ; 0 puts fib on odd 16 byte boundary clk = 0131876622h => 1.465 seconds ; 16 puts fib on even 16 byte boundary clk = 01A13C8CB8h => 2.000 seconds ; nop padding effect on computed jump version with 9 padding above ; 0 puts fib on odd 16 byte boundary clk = 00D979792Dh => 1.042 seconds ; 16 puts fib on even 16 byte boundary clk = 00DA93E04Dh => 1.048 seconds rept 0 nop endm padx endp if 1 ;0 = loop version, 1 = computed jump version fib proc ;rcx == n mov r8,rcx ;set jmp adr mov r9,offset fib0+279 lea r8,[r8+r8*2] neg r8 add r8,r9 mov rax,rcx ;set rax,rdx mov rdx,1 and rax,rdx sub rdx,rax jmp r8 fib0: ; assumes add xxx,xxx takes 3 bytes rept 46 add rax,rdx add rdx,rax endm add rax,rdx ret fib endp else fib proc ;rcx == n mov rax,rcx ;br if < 2 cmp rax,2 jb fib1 mov rdx,1 ;set rax, rdx and rax,rdx sub rdx,rax shr rcx,1 fib0: add rdx,rax add rax,rdx dec rcx jnz fib0 fib1: ret fib endp endif end