Here's some more info to expand on @deniss's answer, which correctly diagnosed the problem.
By the way, this is due to the most popular C ++ Q & A of all time "Why is processing a sorted array faster than an unsorted array?" .
The main problem is that the compiler must observe the logical operator AND (& &), and not load from v [i + 1] if the first condition is not true. This is a consequence of the semantics of the logical operator And, as well as the semantics of the compressed memory model introduced with C ++ 11, the corresponding articles in the draft standard
5.14 The logical operator AND [expr.log.and]
Unlike & , & guarantees an evaluation from left to right: the second operand is not evaluated if the first operand is false .
ISO C ++ 14 Standard (project N3797) sub>
and for speculative readings
1.10 Multithreaded executions and data schedules [intro.multithread]
23 [Note. Transformations that introduce a speculative read of potentially shared memory may not preserve the semantics of a C ++ program, as defined in this standard, since they can potentially lead to data race. However, they are usually valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for calculating data. They would not be valid for a hypothetical machine that is not race-tolerant or provides hardware race detection. - end note]
ISO C ++ 14 standard (project N3797)
My guess is that optimizers play it safely and currently prefer not to release speculative loads into potentially shared memory, rather than a special case for each target processor, whether a speculative load can represent a detectable data race for this purpose.
To implement this, the compiler generates a conditional branch. This is usually not noticeable, because modern processors have very complex industry prediction, and the error prediction rate is usually very low. However, the data here is random - this kills the branch prediction. The cost of erroneous prediction is from 10 to 20 CPU cycles, given that the CPU, as a rule, postpones 2 instructions per cycle, this is equivalent to 20-40 instructions. If the forecast rate is 50% (random), then each iteration has an incorrect prediction, equivalent to 10-20 instructions - HUGE .
Note. . The compiler can prove that the elements v[0] will refer to v[v.size()-2] in this order, regardless of the values ββthat they contain. This would allow the compiler in this case to generate code that unconditionally loads everything except the last element of the vector. The last element of the vector in v [v.size () - 1] can only be loaded into the last iteration of the loop and only if the first condition is true. Therefore, the compiler can generate code for the loop without branching the short circuit to the last iteration, and then use another code with the short circuit branch for the last iteration - this requires the compiler to know that the data is random, and branch prediction is useless and therefore worth working with these - compilers are not that complicated.
To avoid the conditional branch generated by the logical AND (& &), and to avoid loading memory cells into local variables, we can change the logical AND operator to Bitwise AND, a piece of code here , the result is almost 4 times faster when the data is random
int f2() { int n = 0; for (int i = 1; i < v.size()-1; ++i) n += (v[i-1] < v[i]) & (v[i] < v[i+1]);
Exit
3.642443ms min 3.779982ms median Result: 166634 3.725968ms min 3.870808ms median Result: 166634 1.052786ms min 1.081085ms median Result: 166634 done
The result on gcc 5.3 is 8 times faster ( live in Coliru here )
g++ --version g++ -std=c++14 -O3 -Wall -Wextra -pedantic -pthread -pedantic-errors main.cpp -lm && ./a.out g++ (GCC) 5.3.0 Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 3.761290ms min 4.025739ms median Result: 166634 3.823133ms min 4.050742ms median Result: 166634 0.459393ms min 0.505011ms median Result: 166634 done
You may wonder how the compiler can evaluate the comparison v[i-1] < v[i] without generating a conditional branch. The answer depends on the goal, for x86 this is possible because of the SETcc instruction, which generates one byte result, 0 or 1, depending on the state in the EFLAGS register, the same condition that could be used in a conditional branch, but without branching . In the generated code provided by @deniss, you can see the generated setl that sets the result to 1 if the less condition is met, which is evaluated by the previous comparison command:
cmp edx, edi ; a < b ? setl r10b ; r10b = a < b ? 1 : 0 mov ecx, dword ptr [r8 + 4*rsi + 4] ; c = v[i+1] lea rsi, [rsi + 1] ; ++i cmp edi, ecx ; b < c ? setl dl ; dl = b < c ? 1 : 0 and dl, r10b ; dl &= r10b movzx edx, dl ; edx = zero extended dl add eax, edx ; n += edx