Stack distribution function (performance)

During my little research on performance issues, I noticed an interesting stack allocation function, here is a template for measuring time:

#include <chrono> #include <iostream> using namespace std; using namespace std::chrono; int x; //for simple optimization suppression void foo(); int main() { const size_t n = 10000000; //ten millions auto start = high_resolution_clock::now(); for (size_t i = 0; i < n; i++) { foo(); } auto finish = high_resolution_clock::now(); cout << duration_cast<milliseconds>(finish - start).count() << endl; } 

Now all about the implementation of foo() , in each implementation only 500000 ints will be allocated:

  • Highlighted in section one :

     void foo() { const int size = 500000; int a1[size]; x = a1[size - 1]; } 

    Result: 7.3 seconds ;

  • Selected in two fragments:

     void foo() { const int size = 250000; int a1[size]; int a2[size]; x = a1[size - 1] + a2[size - 1]; } 

    Result: 3.5 seconds ;

  • Highlighted in four fragments:

     void foo() { const int size = 125000; int a1[size]; int a2[size]; int a3[size]; int a4[size]; x = a1[size - 1] + a2[size - 1] + a3[size - 1] + a4[size - 1]; } 

    Result: 1.8 seconds .

etc. I divided it into 16 pieces and got a result time of 0.38 seconds .


Please explain to me why and how this happens.
I used MSVC 2013 (v120), build release.

UPD:
My machine is an x64 platform. And I compiled it using the Win32 platform.
When I compile it with the x64 platform, then it gives in all cases about 40 ms.
Why is the choice of platform so much affected?

+8
c ++ performance c stack allocation
source share
2 answers

Considering the disassembly from version VS2015 Update 3, in array versions 2 and 4 of foo , the compiler optimizes unused arrays so that it only saves stack space for 1 array in each function. Since later functions have smaller arrays, this takes less time. Destination x reads the same memory location for both / all 4 arrays. (Since arrays are uninitialized, reading from them is undefined behavior.) Without code optimization, there are 2 or 4 different arrays that are read from.

The long time spent on these functions is related to the stack discovery performed by __ chkstk as part of the discovery (necessary when the compiler needs more than 1 page of space to store all local variables).

+8
source share

You should look at the assembler code to see what your compiler really does with the code. For gcc / clang / icc you can use Matt Godbolt Compiler Explorer .

clang optimizes everything because of UB, and the result ( foo is the first version, foo2 is the second version:

 foo: # @foo retq foo2: # @foo2 retq 

icc considers both versions very similar:

 foo: pushq %rbp #4.1 movq %rsp, %rbp #4.1 subq $2000000, %rsp #4.1 movl -4(%rbp), %eax #8.9 movl %eax, x(%rip) #8.5 leave #10.1 ret #10.1 foo2: pushq %rbp #13.1 movq %rsp, %rbp #13.1 subq $2000000, %rsp #13.1 movl -1000004(%rbp), %eax #18.9 addl -4(%rbp), %eax #18.24 movl %eax, x(%rip) #18.5 leave #19.1 ret 

and gcc creates different assembler code for a different version. Version 6.1 creates code that will show similar behavior as your experiments:

 foo: pushq %rbp movq %rsp, %rbp subq $2000016, %rsp movl 1999996(%rsp), %eax movl %eax, x(%rip) leave ret foo2: pushq %rbp movl $1000016, %edx #only the first array is allocated movq %rsp, %rbp subq %rdx, %rsp leaq 3(%rsp), %rax subq %rdx, %rsp shrq $2, %rax movl 999996(,%rax,4), %eax addl 999996(%rsp), %eax movl %eax, x(%rip) leave ret 

Thus, the only way to understand the difference is to look at the assembler code generated by the compiler , everything else just wonders.

+1
source share

All Articles