Visual C ++ optimization options - how to improve code output?

Question

Visual C ++ optimization options - how to improve code output?

Are there any options (other than / O2) to improve Visual C ++ code output? In this regard, the MSDN documentation is pretty bad. Please note that I am not asking about the general settings of the project (link optimization, etc.). I'm only interested in this specific example.

Simple C ++ 11 code is as follows:

#include <vector> int main() { std::vector<int> v = {1, 2, 3, 4}; int sum = 0; for(int i = 0; i < v.size(); i++) { sum += v[i]; } return sum; }

Clang's output with libC ++ is quite compact:

 main: # @main mov eax, 10 ret

The result of Visual C ++, on the other hand, is a multi-page mess. Am I missing something here or is VS really that bad?

Compiler Explorer Link: https://godbolt.org/g/GJYHjE

+7

c ++ c ++ 11 visual-c ++ cl

Alexander Feb 02 '18 at 15:08

source share

1 answer

valiano · Accepted Answer · 2018-02-02T23:13:48+0000

Unfortunately, I could not find a way to significantly improve Visual C ++ output in this case, even using more aggressive optimization flags. There seem to be several factors contributing to VS inefficiencies, including the lack of specific compiler optimizations and the Microsoft implementation structure <vector> .

By checking the generated assembly, Klang does an objectively wonderful job of optimizing this code. In particular, compared to VS, Clang can perform very efficient distribution of Constant, Function Inlining (and therefore Dead Code Elimination) and New / delete optimization.

Constant distribution

In the example, the vector is statically initialized:

 std::vector<int> v = {1, 2, 3, 4};

Typically, the compiler will store constants 1, 2, 3, 4 in the data memory, and in the for loop it will load one value one at a time, starting at the low address that stores 1, and add each value to the sum.

Here is a shortened VS code for this:

 movdqa xmm0, XMMWORD PTR __xmm@00000004000000030000000200000001 ... movdqu XMMWORD PTR $T1[rsp], xmm0 ; Store integers 1, 2, 3, 4 in memory ... $LL4@main : add ebx, DWORD PTR [rdx] ; loop and sum the values lea rdx, QWORD PTR [rdx+4] inc r8d movsxd rax, r8d cmp rax, r9 jb SHORT $LL4@main

However, Klang is very smart to understand that the amount can be calculated in advance. My best guess is that it replaces loading constants from memory into constant mov operations into registers (propagates constants), and then combines them into a result of 10. This has a useful side effect of breaking dependencies, and since addresses no longer load, the compiler can remove everything else as dead code.

Clang is apparently unique in this - neither VS nor GCC were able to predict the result of vector accumulation in advance.

New / Remove Optimization

Compilers complying with C ++ 14 can omit calls to new and delete under certain conditions, in particular when the number of distribution calls is not part of the program’s observed behavior ( N3664 standard paper). This has already sparked a lot of discussion on SO:

clang vs gcc - optimization, including a new operator
Is the compiler allowed to optimize the memory allocation of the heap memory?
Optimization raw new [] / delete [] vs std :: vector

Clang, called with -std=c++14 -stdlib=libc++ , actually does this optimization and eliminates calls for new and delete that have side effects, but presumably do not affect the observed behavior of the program. With the help of -stdlib=libstdc++ Clang is more strict and saves calls to new ones and deletes them - although looking at the assembly it is clear that they really are not needed.

Now, checking the main code generated by VS , we can find there two function calls (with the rest of the vector construction and the iterative code enclosed in main ):

 call std::vector<int,std::allocator<int> >::_Range_construct_or_tidy<int const * __ptr64>

and

 call void __cdecl operator delete(void * __ptr64)

The first is used to select a vector, and the second to free it, and almost all other functions at the output of VS are pulled by these function calls. This means that Visual C ++ will not optimize distribution function calls (to match C ++ 14, we must add the /std:c++14 flag, but the results will be the same).

This blog post (May 10, 2017) from the Visual C ++ team confirms that this optimization is not implemented. A search on page N3664 shows that “Preventing / Splicing Distribution” is in N / A status, and the related comment says:

[E] Permissions to prevent / merge are allowed but not required. Currently, we have decided not to implement this.

Combining new / delete optimization and continuous distribution, it's easy to see the impact of these two optimizations in this Explorer Compiler Three-way clan comparison with -stdlib=libc++ , Clang with -stdlib=libstdc++ and GCC.

STL execution

VS has its own implementation of STL, which is very different from lib ++ and stdlib ++, and this seems to make a big contribution to the generation of VS code. Although VS STL has some very useful features, such as proven iterators and iterator debugging debuggers ( _ITERATOR_DEBUG_LEVEL ), my general impression is that it is more structured and less efficient than stdlibC ++.

To highlight the impact of the vector STL implementation, an interesting experiment would be to use Clang to compile in conjunction with VS header files. Indeed, using Clang 5.0.0 with Visual Studio 2015 headers leads to the following code generation - obviously, the STL implementation has a huge impact!

 main: # @main .Lfunc_begin0: .Lcfi0: .seh_proc main .seh_handler __CxxFrameHandler3, @unwind, @except # BB#0: # %.lr.ph pushq %rbp .Lcfi1: .seh_pushreg 5 pushq %rsi .Lcfi2: .seh_pushreg 6 pushq %rdi .Lcfi3: .seh_pushreg 7 pushq %rbx .Lcfi4: .seh_pushreg 3 subq $72, %rsp .Lcfi5: .seh_stackalloc 72 leaq 64(%rsp), %rbp .Lcfi6: .seh_setframe 5, 64 .Lcfi7: .seh_endprologue movq $-2, (%rbp) movl $16, %ecx callq " ??2@YAPEAX _K@Z " movq %rax, -24(%rbp) leaq 16(%rax), %rcx movq %rcx, -8(%rbp) movups .L.ref.tmp(%rip), %xmm0 movups %xmm0, (%rax) movq %rcx, -16(%rbp) movl 4(%rax), %ebx movl 8(%rax), %esi movl 12(%rax), %edi .Ltmp0: leaq -24(%rbp), %rcx callq " ?_Tidy@ ?$vector@HV ?$allocator@H @ std@ @@ std@ @IEAAXXZ" .Ltmp1: # BB#1: # %"\ 01??1?$vector@HV ?$allocator@H @ std@ @@ std@ @ QEAA@XZ.exit " addl %ebx, %esi leal 1(%rdi,%rsi), %eax addq $72, %rsp popq %rbx popq %rdi popq %rsi popq %rbp retq .seh_handlerdata .long ($cppxdata$main)@IMGREL .text

Update - Visual Studio 2017

In Visual Studio 2017, <vector> looked at a major overhaul as announced in this blog post from the Visual C ++ team. In particular, he mentions the following optimizations:

Eliminated unnecessary EH logic. For example, the vector copy assignment operator had an unnecessary try-catch block. It simply has to provide the basic guarantee that we can achieve through the right sequence of actions.
Performance improved by avoiding unnecessary rotate () calls. For example, emplace (where, val) called emplace_back () and then rotate (). Now vector calls rotate () in only one scenario (inserting a range with input-only iterators, as described above).
Improved performance with stateful generators. For example, moving constructs with unequal allocators is now trying to activate our memmove () optimization. (We used to use make_move_iterator (), which had a side effect that prevented memmove () from being optimized.) Note that a further update is expected in the VS 2017 update 1, where the move destination will try to reuse the buffer in non-POCMA, which is not an equal case.

Curious, I came back to check it out. When building an example in Visual Studio 2017, the result remains a list of assemblies on several pages with many function calls, so even with improved code generation, this is difficult to notice.

However, when creating with clang 5.0.0 and Visual Studio 2017 headers, we get the following assembly:

 main: # @main .Lcfi0: .seh_proc main # BB#0: subq $40, %rsp .Lcfi1: .seh_stackalloc 40 .Lcfi2: .seh_endprologue movl $16, %ecx callq " ??2@YAPEAX _K@Z " ; void * __ptr64 __cdecl operator new(unsigned __int64) movq %rax, %rcx callq " ??3@YAXPEAX @Z" ; void __cdecl operator delete(void * __ptr64) movl $10, %eax addq $40, %rsp retq .seh_handlerdata .text

Pay attention to the instruction movl $10, %eax , that is, since 2017 <vector> , clang managed to collapse everything, pre-calculate the result of 10 and save only calls for new ones and delete.

I would say that it is awesome!

Inlining function

Functional insertion is probably the most important optimization in this example. When you roll up the code of the called functions to your call sites, the compiler can perform further optimizations using the combined code, and deleting function calls is beneficial in reducing overhead and removing optimization barriers.

When checking the generated assembly for VS and comparing the code before and after embedding ( Compiler Explorer ), we see that most of the vector functions were actually included, with the exception of the distribution and release functions. In particular, there are calls for memmove that result from nesting some higher-level functions, such as _Uninitialized_copy_al_unchecked .

memmove is a library function and therefore cannot be built-in. However, clang has a smart way around this - it replaces the memmove call with __builtin_memmove . __builtin_memmove is a built-in / internal function that has the same functionality as memmove , but unlike calling a simple function, the compiler generates code for it and inserts it into the calling function. Therefore, the code can be further optimized inside the calling function and ultimately deleted as dead code.

Summary

In conclusion, Clang is clearly superior to VS in this example, both due to high quality optimization and more efficient implementation of vector STL. When using the same header files for Visual C ++ and clang (Visual Studio 2017 headers), Clang is superior to Visual C ++.

While writing this answer, I could not help but think what would we do without Compiler> ? Thanks Matt Godbolt for this great tool!

Visual C ++ optimization options - how to improve code output?

More articles: