When starting a slightly modified program (located below) these are the numbers I got:
x64 Release.NET 4.6.1
TestRun: Total: 1000000000, Outer: 100, Inner: 10000000 add_k_array, elapsed 1296 ms, accumulated result 495000099L add_k_list, elapsed 2675 ms, accumulated result 495000099L add_k_list_mutable, elapsed 2678 ms, accumulated result 495000099L TestRun: Total: 1000000000, Outer: 1000, Inner: 1000000 add_k_array, elapsed 869 ms, accumulated result 499624318L add_k_list, elapsed 2486 ms, accumulated result 499624318L add_k_list_mutable, elapsed 2483 ms, accumulated result 499624318L TestRun: Total: 1000000000, Outer: 10000, Inner: 100000 add_k_array, elapsed 750 ms, accumulated result 507000943L add_k_list, elapsed 1602 ms, accumulated result 507000943L add_k_list_mutable, elapsed 1603 ms, accumulated result 507000943L
x86 Release.NET 4.6.1
TestRun: Total: 1000000000, Outer: 100, Inner: 10000000 add_k_array, elapsed 1601 ms, accumulated result 495000099L add_k_list, elapsed 2014 ms, accumulated result 495000099L add_k_list_mutable, elapsed 1835 ms, accumulated result 495000099L TestRun: Total: 1000000000, Outer: 1000, Inner: 1000000 add_k_array, elapsed 1495 ms, accumulated result 499624318L add_k_list, elapsed 1714 ms, accumulated result 499624318L add_k_list_mutable, elapsed 1595 ms, accumulated result 499624318L TestRun: Total: 1000000000, Outer: 10000, Inner: 100000 add_k_array, elapsed 1363 ms, accumulated result 507000943L add_k_list, elapsed 1406 ms, accumulated result 507000943L add_k_list_mutable, elapsed 1221 ms, accumulated result 507000943L
(As usual, it is important not to run with an attached debugger, as this changes the way JIT: er works. With a debugger connected, JIT: er creates code that is easier for the debugger, but also slower.)
How it works is that the total number of iterations is kept constant, but it changes the outer loop count and the size of the list / array.
For me, the only dimension that is odd is that in some cases the array loop is worse than the list loop.
If the total amount of work is the same, why do we see different results when changing external / internal?
The answer is most likely related to the processor cache. When we iterate over an array of size 10,000,000, its actual size in memory is 40,000,000 bytes. My machine has "only" 6,000,000 bytes of L3 cache. When the size of the array is 1,000,000, the size of the array is 4,000,000 bytes, which can fit in L3.
The list type in F # is essentially a single-linked list, and an approximate estimate of the list item is 4 (data) +8 (64-bit pointer) +8 (vtable pointer) +4 (heap overhead) = 24 bytes. With this estimate, the size of the list with 10,000,000 items is 240,000,000 bytes, and the size of 1,000,000 items is 24,000,000. Not suitable for L3 cache on my machine.
When the number of elements is 100,000, the size of the array is 400,000 bytes and the size of the list is 2,400,000. Both fit snugly against the L3 cache.
This reasoning may explain the difference in performance between smaller arrays / lists compared to larger ones.
If the list items are not allocated sequentially (i.e. the heap is fragmented or the GC moves them), it is expected that the performance of the list will be much worse if it does not go into the cache, since the processor prefetching strategy no longer works. The elements in the array are guaranteed to always be sequential, so prefetching will work fine if you repeat sequentially.
Why is tail recursion slower than a loop variable?
This is really not true in F # 3, where the for loop is expected to be much slower than tail recursion.
As a hint for an answer, I used ILSpy to view the generated IL code.
I found that FSharpList<>::get_TailOrNull() is called twice per loop when using tail-recursion. Once to check if we have reached the end and once to get the next element (redundant call).
The for loop version only calls FSharpList<>::get_TailOrNull() once.
This extra call probably explains why tail recursion is slower, but as you noted in bit mode, 64 bits, both versions of the list were just as fast. What's happening?
I checked the build code of JIT: ed and noted that x64 JIT: er excluded an additional call to FSharpList<>::get_TailOrNull() . X86 JIT: er could not resolve the call.
Finally, why is the version of the array slower than the list version on x86?
In general, I expect arrays to have the lowest overhead for the entire collection in .NET. The reason is that it is compact, serial and there are special instructions in ILAsm for accessing elements.
So it seems to me that in some cases, lists work better.
Checking the assembly code again, which is apparently due to the fact that the version of the array requires an additional variable to do its job, and the x86 processor has several registers available, which leads to additional reading from the stack for iteration. x64 has significantly more registers, which leads to the fact that the version of the array should read only once from memory at iteration, where the version of the list is read twice (head and tail).
Any conclusions?
- When it comes to x64 processor performance, this is the way to go (this is not always the case)
- Expects arrays to work better than any data structure in .NET for operations in which array operations are O (1) (inserts are slow, obviously)
- The devil is in the details to get the true idea that we need to check the assembly code.
- The cache locality is very important for large collections. Because arrays are compact and guaranteed to be consistent, they are often a good choice.
- Itβs very difficult to predict performance, always measure
- Iterate to zero when possible, if you're really hungry for performance. This may save one from memory.
EDIT: OP wonders why x86 lists seemed to show better x64 lists
I repeated the perfection tests with a list / array size of 1000. This will make sure that the entire data structure fits into my L1 cache (256 KB)
x64 Release.NET 4.6.1
TestRun: Total: 1000000000, Outer: 1000000, Inner: 1000 add_k_array, elapsed 1062 ms, accumulated result 999499999L add_k_list, elapsed 1134 ms, accumulated result 999499999L add_k_list_mutable, elapsed 1110 ms, accumulated result 999499999L
x86 Release.NET 4.6.1
TestRun: Total: 1000000000, Outer: 1000000, Inner: 1000 add_k_array, elapsed 1617 ms, accumulated result 999499999L add_k_list, elapsed 1359 ms, accumulated result 999499999L add_k_list_mutable, elapsed 1100 ms, accumulated result 999499999L
We see that for this size, x64 seems to work just about as good or better than x86. Why do we see the opposite in other dimensions? I assume this is because the size of the list items is larger in x64 versions, which means we are using more data about the bandwidth movement from L3 to L1.
So, if you can try to make sure your data fits into L1 cache.
Final thoughts
When dealing with such questions, I sometimes wonder if the whole Von Neumann architecture is a big mistake. Instead, we should have a data flow architecture, because the data is slow and the instructions are fast.
AFAIK under the hood of the CPU: s has a data stream architecture. The assembler language, however, looks as one would expect from von Neumann architecture, so in a sense it is a high-level abstraction over the architecture of the data stream. But in order to provide a reasonable code of performers, the processor matrix is ββmainly occupied by the cache (~ 95%). With a clean data flow architecture, one would expect a higher percentage of the processor chip to do the actual work.
Hope this was interesting, my modified program:
open System.Diagnostics let stopWatch = let sw = Stopwatch () sw.Start () sw let timeIt (name : string) (outer : int) (a : int -> int64) : unit = let t = stopWatch.ElapsedMilliseconds let mutable acc = a 0 for i = 2 to outer do acc <- acc + ai let d = stopWatch.ElapsedMilliseconds - t printfn "%s, elapsed %d ms, accumulated result %A" name d acc let add_k_list xl (k_range: int list) = let rec add k_range x acc = match k_range with | [] -> acc | k::ks -> let y = x ^^^ k if (y < k || y > l) then add ks x (acc + 1L) else add ks x acc add k_range x 0L let add_k_list_mutable xl (k_range: int list) = let mutable count = 0L for k in k_range do let y = x ^^^ k if (y < k || y > l) then count <- count + 1L count let add_k_array xl (k_range: int []) = let mutable count = 0L for k in k_range do let y = x ^^^ k if (y < k || y > l) then count <- count + 1L count [<EntryPoint>] let main argv = let total = 1000000000 let outers = [|100; 1000; 10000|] for outer in outers do let inner = total / outer printfn "TestRun: Total: %d, Outer: %d, Inner: %d" total outer inner ignore <| System.GC.WaitForFullGCComplete () let testList = [1..inner] let testArray = [|1..inner|] timeIt "add_k_array" outer <| fun x -> add_k_array x inner testArray timeIt "add_k_list" outer <| fun x -> add_k_list x inner testList timeIt "add_k_list_mutable" outer <| fun x -> add_k_list_mutable x inner testList 0