An important classifier here is the meaning of "least instruction." If this needs to be interpreted as causing the CPU to take the smallest steps, and we additionally stipulate that there are no modern methods for use, such as SIMD, GPU or OMP programming (or other automatic parallel technologies) .... just C or C ++, then consider:
Assuming something like:
int a[ 10 ];
which is filled with data at runtime and will always contain 10 records (from 0 to 9)
std::accumulate does a nice job here, creating a tight loop in assembler, mess ... just fast:
int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 );
Unless, of course, some const int representing the size of the array 'a' will be okay.
This is compared with curiosity with:
for( int n=0; n < 10; ++n ) r += a[ n ];
The compiler very elegantly emits 10 add commands that don't expand - it doesn't even bother with the loop.
Now this means that in std::accumulate , although the loop is hard, there are at least two add commands for each element (one for the sum and one for increasing the iterator). Add to this the comparison instruction and conditional jump, and at least 4 instructions per element or about 40 steps of machine language of varying value in ticks.
On the other hand, the detailed result of the for loop is just 10 machine steps that the CPU might plan with great cache convenience and no jumps.
The for loop is definitely faster.
The compiler “knows” what you are trying to do and gets to work, and you can also think it through with the proposed code that you published.
In addition, if the size of the array becomes too outlandish for the loop to unfold, the compiler automatically performs classical optimization, which for some reason does not seem to std::accumulate that std::accumulate does not work ... i.e. performs two additions for each cycle (when it creates a cycle for the reason for the number of elements).
Using VC 2012, this source:
int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 ); int z = 0; int *ap = a; int *ae = &a[9]; while( ap <= ae ) { z += *ap; ++ap; } int z2 = 0; for (int n=0; n < 10; ++n ) z2 += a[ n ];
Produces the following assembler fragments in release builds in VC2012
int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 ); 00301270 33 D2 xor edx,edx 00301272 B8 D4 40 30 00 mov eax,3040D4h 00301277 EB 07 jmp wmain+10h (0301280h) 00301279 8D A4 24 00 00 00 00 lea esp,[esp] 00301280 03 10 add edx,dword ptr [eax] 00301282 83 C0 04 add eax,4 00301285 3D F8 40 30 00 cmp eax,3040F8h 0030128A 75 F4 jne wmain+10h (0301280h) while( ap <= ae ) { z += *ap; ++ap; } 003012A0 03 08 add ecx,dword ptr [eax] 003012A2 03 70 04 add esi,dword ptr [eax+4] 003012A5 83 C0 08 add eax,8 003012A8 3D F4 40 30 00 cmp eax,3040F4h 003012AD 7E F1 jle wmain+30h (03012A0h) 003012AF 3D F8 40 30 00 cmp eax,3040F8h 003012B4 77 02 ja wmain+48h (03012B8h) 003012B6 8B 38 mov edi,dword ptr [eax] 003012B8 8D 04 0E lea eax,[esi+ecx] 003012BB 03 F8 add edi,eax for (int n=0; n < 10; ++n ) z2 += a[ n ]; 003012BD A1 D4 40 30 00 mov eax,dword ptr ds:[003040D4h] 003012C2 03 05 F8 40 30 00 add eax,dword ptr ds:[3040F8h] 003012C8 03 05 D8 40 30 00 add eax,dword ptr ds:[3040D8h] 003012CE 03 05 DC 40 30 00 add eax,dword ptr ds:[3040DCh] 003012D4 03 05 E0 40 30 00 add eax,dword ptr ds:[3040E0h] 003012DA 03 05 E4 40 30 00 add eax,dword ptr ds:[3040E4h] 003012E0 03 05 E8 40 30 00 add eax,dword ptr ds:[3040E8h] 003012E6 03 05 EC 40 30 00 add eax,dword ptr ds:[3040ECh] 003012EC 03 05 F0 40 30 00 add eax,dword ptr ds:[3040F0h] 003012F2 03 05 F4 40 30 00 add eax,dword ptr ds:[3040F4h]
Based on the comments, I decided to try this in Xcode 7 with completely different results. This unfolds the for loop:
.loc 1 58 36
It may not look as simple as a simple VC list, but it can work just as quickly because the configuration (movq or movl) for each addition can be performed in parallel in the CPU, since the previous record ends its addition, the cost of which costs almost nothing compared to a simple, clean "looking" series of additives to memory sources.
Below is the Xcode std :: accumulator. He sees that init is required, but then he runs a clean series of additions, expanding the loop that VC did not.
.file 37 "/Applications/Xcode7.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1" "numeric" .loc 37 75 27 is_stmt 1
The bottom line is that the optimization we rely on from compilers is so wildly different from one compiler to another that we should rely on them, but look.
LLVM is pretty revealing and understands std::accumulate better than VC, it seems, but this short study cannot reveal if this is the difference in the implementation of the library or the compiler. There may be significant differences in the implementation of Xcode std::accumulate , which give the compiler a deeper understanding than the version of the VC library.
This applies generally to algorithms, even digital ones. std::accumulate - for loop. Most likely, it expanded as a built-in loop based on pointers in an array, so choosing VC to create a loop for std :: accumulate was an echo in its choice to create a loop for the code using int * to loop through the array, but expanded the loop for the for loop, using an integer to refer to the entries in the array by index. In other words, this did not actually improve in the straight line when pointers were used, in which case it assumes a VC optimizer, not a library.
This follows its own illustrative example of an idea available to the Struustrup compiler, comparing qsort with C and sorting with C ++. qsort takes a pointer to a function for comparison, cutting off the compiler from understanding the comparison, forcing it to call the function with a pointer. On the other hand, the C ++ sort function accepts a functor that passes more information about the comparison. This can still lead to a function call, but the optimizer has the ability to understand the comparison well enough to make it inline.
In the case of VC, for some reason (we would have to like Microsoft), the compiler gets confused when passing through an array through pointers. The information provided to it differs from the loop using an integer to index the array. He understands this, but not pointers. LLVM, in contrast, understood as (and more). The difference in information is not important for LLVM, but for VC. Since std::accumulate indeed an inline representing a for loop, and this loop is processed with pointers, it avoided recognizing VC, like VC in a direct pointer-based loop. If specialization could be done for whole arrays, such as copied with indexes and not with pointers, VC would respond better, but that should not be so.
A bad optimizer can skip this point, and a poor library implementation can confuse the optimizer, which means that under the best conditions, std::accumulate can work like a for loop for a simple array of integers, producing a deployed version of the loop that creates the sum, but not always. However, little can hinder the compiler from understanding the for.everything loop right there, and the library implementation cannot mess it up, it's all up to the compiler at this point. For this, VC shows weakness.
I tried all the settings on VC to try to deploy std::accumulate , but so far it has never done (have not tried newer versions of VC).
It doesn't take much for Xcode to expand the loop; LLVM seems to have a deeper engineering. Perhaps this will also improve the implementation of the library.
By the way, the C code example that I posted on top was used in VC, which did not recognize that three different summations were related. The XVode LLVM did, which meant that the first time I tried it, it just accepted a response from std :: accumulate and did nothing. At that time, VC was very weak. To force Xcode to run 3 separate tests, I randomized the array before each call ... otherwise Xcode realized what I was doing where VC did not.