It is more efficient to use all the elements of an array of compilation size

Question

It is more efficient to use all the elements of an array of compilation size

I am trying to effectively add everything to an array of compilation size using the minimum number of instructions. Naturally, I use templates. I created it.

template<unsigned int startIndex, unsigned int count> int AddCollapseArray(int theArray[]) { if(count == 1) { return theArray[startIndex]; } else if(count == 2) { return theArray[startIndex] + theArray[startIndex + 1]; } else if(count % 2 == 0) { return AddCollapseArray<startIndex, count / 2>(theArray) + AddCollapseArray<startIndex + count / 2, count / 2>(theArray)); } else if (count % 2 == 1) { int newCount = count-1; return AddCollapseArray<startIndex, newCount/ 2>(theArray) + AddCollapseArray<startIndex + newCount/ 2, newCount/ 2>(theArray)) + theArray[startIndex + newCount]; } }

It seems like he will do the job most effectively for me. I think branching and arithmetic, apart from additions, will be fully optimized. Are there any flaws in this?

+2

c ++ recursion templates

Thomas 15 sept. '15 at 2:03

source share

3 answers

Do not try to outsmart the optimizer. All this complex template engine just makes it harder for the optimizer to understand what you really want to do.

For instance,

 int f0(int *p) { return AddCollapseArray<0, 10>(p); } int f1(int *p) { return std::accumulate(p+0, p+10, 0); }

Creates the same assembly from clang to -O3

 f0(int*): # @f0(int*) movl 4(%rdi), %eax addl (%rdi), %eax addl 8(%rdi), %eax addl 12(%rdi), %eax addl 16(%rdi), %eax addl 20(%rdi), %eax addl 24(%rdi), %eax addl 28(%rdi), %eax addl 32(%rdi), %eax addl 36(%rdi), %eax retq f1(int*): # @f1(int*) movl 4(%rdi), %eax addl (%rdi), %eax addl 8(%rdi), %eax addl 12(%rdi), %eax addl 16(%rdi), %eax addl 20(%rdi), %eax addl 24(%rdi), %eax addl 28(%rdi), %eax addl 32(%rdi), %eax addl 36(%rdi), %eax retq

Say we want to make 100 elements:

 int f0(int *p) { return AddCollapseArray<0, 100>(p); } int f1(int *p) { return std::accumulate(p+0, p+100, 0); }

Here we get:

 f0(int*): # @f0(int*) pushq %rbp pushq %rbx pushq %rax movq %rdi, %rbx callq int AddCollapseArray<0u, 50u>(int*) movl %eax, %ebp movq %rbx, %rdi callq int AddCollapseArray<50u, 50u>(int*) addl %ebp, %eax addq $8, %rsp popq %rbx popq %rbp retq f1(int*): # @f1(int*) movdqu (%rdi), %xmm0 movdqu 16(%rdi), %xmm1 movdqu 32(%rdi), %xmm2 movdqu 48(%rdi), %xmm3 paddd %xmm0, %xmm1 paddd %xmm2, %xmm1 paddd %xmm3, %xmm1 movdqu 64(%rdi), %xmm0 paddd %xmm1, %xmm0 movdqu 80(%rdi), %xmm1 paddd %xmm0, %xmm1 movdqu 96(%rdi), %xmm0 paddd %xmm1, %xmm0 movdqu 112(%rdi), %xmm1 paddd %xmm0, %xmm1 movdqu 128(%rdi), %xmm0 paddd %xmm1, %xmm0 movdqu 144(%rdi), %xmm1 paddd %xmm0, %xmm1 movdqu 160(%rdi), %xmm0 paddd %xmm1, %xmm0 movdqu 176(%rdi), %xmm1 paddd %xmm0, %xmm1 movdqu 192(%rdi), %xmm0 paddd %xmm1, %xmm0 movdqu 208(%rdi), %xmm1 paddd %xmm0, %xmm1 movdqu 224(%rdi), %xmm0 paddd %xmm1, %xmm0 movdqu 240(%rdi), %xmm1 paddd %xmm0, %xmm1 movdqu 256(%rdi), %xmm0 paddd %xmm1, %xmm0 movdqu 272(%rdi), %xmm1 paddd %xmm0, %xmm1 movdqu 288(%rdi), %xmm0 paddd %xmm1, %xmm0 movdqu 304(%rdi), %xmm1 paddd %xmm0, %xmm1 movdqu 320(%rdi), %xmm0 paddd %xmm1, %xmm0 movdqu 336(%rdi), %xmm1 paddd %xmm0, %xmm1 movdqu 352(%rdi), %xmm0 paddd %xmm1, %xmm0 movdqu 368(%rdi), %xmm1 paddd %xmm0, %xmm1 movdqu 384(%rdi), %xmm0 paddd %xmm1, %xmm0 pshufd $78, %xmm0, %xmm1 # xmm1 = xmm0[2,3,0,1] paddd %xmm0, %xmm1 pshufd $229, %xmm1, %xmm0 # xmm0 = xmm1[1,1,2,3] paddd %xmm1, %xmm0 movd %xmm0, %eax retq int AddCollapseArray<0u, 50u>(int*): # @int AddCollapseArray<0u, 50u>(int*) movl 4(%rdi), %eax addl (%rdi), %eax addl 8(%rdi), %eax addl 12(%rdi), %eax addl 16(%rdi), %eax addl 20(%rdi), %eax addl 24(%rdi), %eax addl 28(%rdi), %eax addl 32(%rdi), %eax addl 36(%rdi), %eax addl 40(%rdi), %eax addl 44(%rdi), %eax addl 48(%rdi), %eax addl 52(%rdi), %eax addl 56(%rdi), %eax addl 60(%rdi), %eax addl 64(%rdi), %eax addl 68(%rdi), %eax addl 72(%rdi), %eax addl 76(%rdi), %eax addl 80(%rdi), %eax addl 84(%rdi), %eax addl 88(%rdi), %eax addl 92(%rdi), %eax addl 96(%rdi), %eax addl 100(%rdi), %eax addl 104(%rdi), %eax addl 108(%rdi), %eax addl 112(%rdi), %eax addl 116(%rdi), %eax addl 120(%rdi), %eax addl 124(%rdi), %eax addl 128(%rdi), %eax addl 132(%rdi), %eax addl 136(%rdi), %eax addl 140(%rdi), %eax addl 144(%rdi), %eax addl 148(%rdi), %eax addl 152(%rdi), %eax addl 156(%rdi), %eax addl 160(%rdi), %eax addl 164(%rdi), %eax addl 168(%rdi), %eax addl 172(%rdi), %eax addl 176(%rdi), %eax addl 180(%rdi), %eax addl 184(%rdi), %eax addl 188(%rdi), %eax addl 192(%rdi), %eax addl 196(%rdi), %eax retq int AddCollapseArray<50u, 50u>(int*): # @int AddCollapseArray<50u, 50u>(int*) movl 204(%rdi), %eax addl 200(%rdi), %eax addl 208(%rdi), %eax addl 212(%rdi), %eax addl 216(%rdi), %eax addl 220(%rdi), %eax addl 224(%rdi), %eax addl 228(%rdi), %eax addl 232(%rdi), %eax addl 236(%rdi), %eax addl 240(%rdi), %eax addl 244(%rdi), %eax addl 248(%rdi), %eax addl 252(%rdi), %eax addl 256(%rdi), %eax addl 260(%rdi), %eax addl 264(%rdi), %eax addl 268(%rdi), %eax addl 272(%rdi), %eax addl 276(%rdi), %eax addl 280(%rdi), %eax addl 284(%rdi), %eax addl 288(%rdi), %eax addl 292(%rdi), %eax addl 296(%rdi), %eax addl 300(%rdi), %eax addl 304(%rdi), %eax addl 308(%rdi), %eax addl 312(%rdi), %eax addl 316(%rdi), %eax addl 320(%rdi), %eax addl 324(%rdi), %eax addl 328(%rdi), %eax addl 332(%rdi), %eax addl 336(%rdi), %eax addl 340(%rdi), %eax addl 344(%rdi), %eax addl 348(%rdi), %eax addl 352(%rdi), %eax addl 356(%rdi), %eax addl 360(%rdi), %eax addl 364(%rdi), %eax addl 368(%rdi), %eax addl 372(%rdi), %eax addl 376(%rdi), %eax addl 380(%rdi), %eax addl 384(%rdi), %eax addl 388(%rdi), %eax addl 392(%rdi), %eax addl 396(%rdi), %eax retq

Not only is your function not fully integrated, but also not vectorized. GCC gives similar results.

+4

TC 15 sept. '15 at 5:05

source share

While std::accumulate should be enough to manually expand the loop, you can do

 namespace detail { template<std::size_t startIndex, std::size_t... Is> int Accumulate(std::index_sequence<Is...>, const int a[]) { int res = 0; const int dummy[] = {0, ((res += a[startIndex + Is]), 0)...}; static_cast<void>(dummy); // Remove warning for unused variable return res; } } template<std::size_t startIndex, std::size_t count> int AddCollapseArray(const int a[]) { return detail::Accumulate<startIndex>(std::make_index_sequence<count>{}, a); }

or in C ++ 17, with the expression fold:

 namespace detail { template<std::size_t startIndex, std::size_t... Is> int Accumulate(std::index_sequence<Is...>, const int a[]) { return (a[startIndex + Is] + ...); } }

0

Jarod42 15 sept. '15 at 8:36

source share

Jven · Accepted Answer · 2015-09-15T02:33:57+0000

An important classifier here is the meaning of "least instruction." If this needs to be interpreted as causing the CPU to take the smallest steps, and we additionally stipulate that there are no modern methods for use, such as SIMD, GPU or OMP programming (or other automatic parallel technologies) .... just C or C ++, then consider:

Assuming something like:

 int a[ 10 ];

which is filled with data at runtime and will always contain 10 records (from 0 to 9)

std::accumulate does a nice job here, creating a tight loop in assembler, mess ... just fast:

 int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 );

Unless, of course, some const int representing the size of the array 'a' will be okay.

This is compared with curiosity with:

 for( int n=0; n < 10; ++n ) r += a[ n ];

The compiler very elegantly emits 10 add commands that don't expand - it doesn't even bother with the loop.

Now this means that in std::accumulate , although the loop is hard, there are at least two add commands for each element (one for the sum and one for increasing the iterator). Add to this the comparison instruction and conditional jump, and at least 4 instructions per element or about 40 steps of machine language of varying value in ticks.

On the other hand, the detailed result of the for loop is just 10 machine steps that the CPU might plan with great cache convenience and no jumps.

The for loop is definitely faster.

The compiler “knows” what you are trying to do and gets to work, and you can also think it through with the proposed code that you published.

In addition, if the size of the array becomes too outlandish for the loop to unfold, the compiler automatically performs classical optimization, which for some reason does not seem to std::accumulate that std::accumulate does not work ... i.e. performs two additions for each cycle (when it creates a cycle for the reason for the number of elements).

Using VC 2012, this source:

  int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 ); int z = 0; int *ap = a; int *ae = &a[9]; while( ap <= ae ) { z += *ap; ++ap; } int z2 = 0; for (int n=0; n < 10; ++n ) z2 += a[ n ];

Produces the following assembler fragments in release builds in VC2012

 int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 ); 00301270 33 D2 xor edx,edx 00301272 B8 D4 40 30 00 mov eax,3040D4h 00301277 EB 07 jmp wmain+10h (0301280h) 00301279 8D A4 24 00 00 00 00 lea esp,[esp] 00301280 03 10 add edx,dword ptr [eax] 00301282 83 C0 04 add eax,4 00301285 3D F8 40 30 00 cmp eax,3040F8h 0030128A 75 F4 jne wmain+10h (0301280h) while( ap <= ae ) { z += *ap; ++ap; } 003012A0 03 08 add ecx,dword ptr [eax] 003012A2 03 70 04 add esi,dword ptr [eax+4] 003012A5 83 C0 08 add eax,8 003012A8 3D F4 40 30 00 cmp eax,3040F4h 003012AD 7E F1 jle wmain+30h (03012A0h) 003012AF 3D F8 40 30 00 cmp eax,3040F8h 003012B4 77 02 ja wmain+48h (03012B8h) 003012B6 8B 38 mov edi,dword ptr [eax] 003012B8 8D 04 0E lea eax,[esi+ecx] 003012BB 03 F8 add edi,eax for (int n=0; n < 10; ++n ) z2 += a[ n ]; 003012BD A1 D4 40 30 00 mov eax,dword ptr ds:[003040D4h] 003012C2 03 05 F8 40 30 00 add eax,dword ptr ds:[3040F8h] 003012C8 03 05 D8 40 30 00 add eax,dword ptr ds:[3040D8h] 003012CE 03 05 DC 40 30 00 add eax,dword ptr ds:[3040DCh] 003012D4 03 05 E0 40 30 00 add eax,dword ptr ds:[3040E0h] 003012DA 03 05 E4 40 30 00 add eax,dword ptr ds:[3040E4h] 003012E0 03 05 E8 40 30 00 add eax,dword ptr ds:[3040E8h] 003012E6 03 05 EC 40 30 00 add eax,dword ptr ds:[3040ECh] 003012EC 03 05 F0 40 30 00 add eax,dword ptr ds:[3040F0h] 003012F2 03 05 F4 40 30 00 add eax,dword ptr ds:[3040F4h]

Based on the comments, I decided to try this in Xcode 7 with completely different results. This unfolds the for loop:

  .loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36 movq _a(%rip), %rax Ltmp22: ##DEBUG_VALUE: do3:z2 <- EAX movq %rax, %rcx shrq $32, %rcx .loc 1 58 33 is_stmt 0 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33 addl %eax, %ecx .loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36 movq _a+8(%rip), %rax Ltmp23: .loc 1 58 33 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33 movl %eax, %edx addl %ecx, %edx shrq $32, %rax addl %edx, %eax .loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36 movq _a+16(%rip), %rcx .loc 1 58 33 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33 movl %ecx, %edx addl %eax, %edx shrq $32, %rcx addl %edx, %ecx .loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36 movq _a+24(%rip), %rax .loc 1 58 33 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33 movl %eax, %edx addl %ecx, %edx shrq $32, %rax addl %edx, %eax .loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36 movq _a+32(%rip), %rcx .loc 1 58 33 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33 movl %ecx, %edx addl %eax, %edx shrq $32, %rcx addl %edx, %ecx

It may not look as simple as a simple VC list, but it can work just as quickly because the configuration (movq or movl) for each addition can be performed in parallel in the CPU, since the previous record ends its addition, the cost of which costs almost nothing compared to a simple, clean "looking" series of additives to memory sources.

Below is the Xcode std :: accumulator. He sees that init is required, but then he runs a clean series of additions, expanding the loop that VC did not.

  .file 37 "/Applications/Xcode7.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1" "numeric" .loc 37 75 27 is_stmt 1 ## /Applications/Xcode7.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/numeric:75:27 movq _a(%rip), %r14 Ltmp11: movq %r14, -48(%rbp) ## 8-byte Spill Ltmp12: shrq $32, %r14 movq _a+8(%rip), %rbx movq %rbx, -56(%rbp) ## 8-byte Spill shrq $32, %rbx movq _a+16(%rip), %r13 movq %r13, -72(%rbp) ## 8-byte Spill shrq $32, %r13 movq _a+24(%rip), %r15 movq %r15, %r12 shrq $32, %r12 Ltmp13: movl _a+32(%rip), %eax Ltmp14: movq -48(%rbp), %rax ## 8-byte Reload addl %eax, %r14d movq -56(%rbp), %rax ## 8-byte Reload addl %eax, %r14d addl %ebx, %r14d movq -72(%rbp), %rax ## 8-byte Reload addl %eax, %r14d addl %r13d, %r14d addl %r15d, %r14d addl %r12d, %r14d addl -64(%rbp), %r14d ## 4-byte Folded Reload

The bottom line is that the optimization we rely on from compilers is so wildly different from one compiler to another that we should rely on them, but look.

LLVM is pretty revealing and understands std::accumulate better than VC, it seems, but this short study cannot reveal if this is the difference in the implementation of the library or the compiler. There may be significant differences in the implementation of Xcode std::accumulate , which give the compiler a deeper understanding than the version of the VC library.

This applies generally to algorithms, even digital ones. std::accumulate - for loop. Most likely, it expanded as a built-in loop based on pointers in an array, so choosing VC to create a loop for std :: accumulate was an echo in its choice to create a loop for the code using int * to loop through the array, but expanded the loop for the for loop, using an integer to refer to the entries in the array by index. In other words, this did not actually improve in the straight line when pointers were used, in which case it assumes a VC optimizer, not a library.

This follows its own illustrative example of an idea available to the Struustrup compiler, comparing qsort with C and sorting with C ++. qsort takes a pointer to a function for comparison, cutting off the compiler from understanding the comparison, forcing it to call the function with a pointer. On the other hand, the C ++ sort function accepts a functor that passes more information about the comparison. This can still lead to a function call, but the optimizer has the ability to understand the comparison well enough to make it inline.

In the case of VC, for some reason (we would have to like Microsoft), the compiler gets confused when passing through an array through pointers. The information provided to it differs from the loop using an integer to index the array. He understands this, but not pointers. LLVM, in contrast, understood as (and more). The difference in information is not important for LLVM, but for VC. Since std::accumulate indeed an inline representing a for loop, and this loop is processed with pointers, it avoided recognizing VC, like VC in a direct pointer-based loop. If specialization could be done for whole arrays, such as copied with indexes and not with pointers, VC would respond better, but that should not be so.

A bad optimizer can skip this point, and a poor library implementation can confuse the optimizer, which means that under the best conditions, std::accumulate can work like a for loop for a simple array of integers, producing a deployed version of the loop that creates the sum, but not always. However, little can hinder the compiler from understanding the for.everything loop right there, and the library implementation cannot mess it up, it's all up to the compiler at this point. For this, VC shows weakness.

I tried all the settings on VC to try to deploy std::accumulate , but so far it has never done (have not tried newer versions of VC).

It doesn't take much for Xcode to expand the loop; LLVM seems to have a deeper engineering. Perhaps this will also improve the implementation of the library.

By the way, the C code example that I posted on top was used in VC, which did not recognize that three different summations were related. The XVode LLVM did, which meant that the first time I tried it, it just accepted a response from std :: accumulate and did nothing. At that time, VC was very weak. To force Xcode to run 3 separate tests, I randomized the array before each call ... otherwise Xcode realized what I was doing where VC did not.

It is more efficient to use all the elements of an array of compilation size

More articles: