Why do C ++ optimizers have problems with these temporary variables, or rather, why `v []` should be avoided in narrow loops?

In this code snippet, I compare the performance of two functionally identical loops:

for (int i = 1; i < v.size()-1; ++i) { int a = v[i-1]; int b = v[i]; int c = v[i+1]; if (a < b && b < c) ++n; } 

and

 for (int i = 1; i < v.size()-1; ++i) if (v[i-1] < v[i] && v[i] < v[i+1]) ++n; 

The first one works much slower than the second one on many different C ++ compilers with the optimization flag set to O2 :

  • the second cycle is about 330% slower now with Clang 3.7.0
  • the second cycle is about 2% slower with gcc 4.9.3
  • the second cycle is about 2% slower with Visual C ++ 2015

I am puzzled that modern C ++ optimizers have problems handling this case. Any clues why? Should I write ugly code without using temporary variables in order to get maximum performance?

Using temporary variables makes code faster, and sometimes dramatically. What's happening?

The full code I'm using is below:

 #include <algorithm> #include <chrono> #include <random> #include <iomanip> #include <iostream> #include <vector> using namespace std; using namespace std::chrono; vector<int> v(1'000'000); int f0() { int n = 0; for (int i = 1; i < v.size()-1; ++i) { int a = v[i-1]; int b = v[i]; int c = v[i+1]; if (a < b && b < c) ++n; } return n; } int f1() { int n = 0; for (int i = 1; i < v.size()-1; ++i) if (v[i-1] < v[i] && v[i] < v[i+1]) ++n; return n; } int main() { auto benchmark = [](int (*f)()) { const int N = 100; volatile long long result = 0; vector<long long> timings(N); for (int i = 0; i < N; ++i) { auto t0 = high_resolution_clock::now(); result += f(); auto t1 = high_resolution_clock::now(); timings[i] = duration_cast<nanoseconds>(t1-t0).count(); } sort(timings.begin(), timings.end()); cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n"; cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n"; }; mt19937 generator (31415); // deterministic seed uniform_int_distribution<> distribution(0, 1023); for (auto& e: v) e = distribution(generator); benchmark(f0); benchmark(f1); cout << "\ndone\n"; return 0; } 
+61
c ++ performance optimization
Apr 05 '16 at 0:07
source share
4 answers

It seems that the compiler does not know about the relationship between std::vector<>::size() and the size of the internal vector buffer. Consider std::vector our custom vector object bugged_vector with a small error - its ::size() can sometimes be larger than the size of the internal buffer n , but only then v[n-2] >= v[n-1] .

Then the two fragments again have different semantics: firstly, they have undefined behavior, since we refer to the element v[v.size() - 1] . The second, however, does not have: due to the && short circuit, we never read v[v.size() - 1] at the last iteration.

So, if the compiler cannot prove that our v not bugged_vector , it must be short-circuited, which leads to an additional jump in machine code.

Looking at the output of the assembly from clang , we see that this is really happening.

From the Godbolt compiler explorer , with clang 3.7.0 -O2, loop in f0 :

 ### f0: just the loop .LBB1_2: # =>This Inner Loop Header: Depth=1 mov edi, ecx cmp edx, edi setl r10b mov ecx, dword ptr [r8 + 4*rsi + 4] lea rsi, [rsi + 1] cmp edi, ecx setl dl and dl, r10b movzx edx, dl add eax, edx cmp rsi, r9 mov edx, edi jb .LBB1_2 

And for f1 :

 ### f1: just the loop .LBB2_2: # =>This Inner Loop Header: Depth=1 mov esi, r10d mov r10d, dword ptr [r9 + 4*rdi] lea rcx, [rdi + 1] cmp esi, r10d jge .LBB2_4 # <== This is Extra Jump cmp r10d, dword ptr [r9 + 4*rdi + 4] setl dl movzx edx, dl add eax, edx .LBB2_4: # %._crit_edge.3 cmp rcx, r8 mov rdi, rcx jb .LBB2_2 

I pointed to an extra jump in f1 . And, as we (hopefully) know, conditional jumps in tight loops are bad for performance. (See Performance Guides on the x86 wiki for details.)

GCC and Visual Studio know that std::vector behaves well and creates an almost identical assembly for both fragments. Edit It turns out that clang better optimizes the code. All three compilers cannot prove that it is safe to read v[i + 1] before comparing in the second example (or not to choose), but only clang manages to optimize the first example with additional information that reads v[i + 1] either valid or UB.

The difference in performance by 2% is insignificant, it can be explained in a different order or by the choice of some instructions.

+50
Apr 05 '16 at 0:57
source share

Here's some more info to expand on @deniss's answer, which correctly diagnosed the problem.

By the way, this is due to the most popular C ++ Q & A of all time "Why is processing a sorted array faster than an unsorted array?" .

The main problem is that the compiler must observe the logical operator AND (& &), and not load from v [i + 1] if the first condition is not true. This is a consequence of the semantics of the logical operator And, as well as the semantics of the compressed memory model introduced with C ++ 11, the corresponding articles in the draft standard

5.14 The logical operator AND [expr.log.and]

Unlike & , & guarantees an evaluation from left to right: the second operand is not evaluated if the first operand is false .
ISO C ++ 14 Standard (project N3797) sub>

and for speculative readings

1.10 Multithreaded executions and data schedules [intro.multithread]

23 [Note. Transformations that introduce a speculative read of potentially shared memory may not preserve the semantics of a C ++ program, as defined in this standard, since they can potentially lead to data race. However, they are usually valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for calculating data. They would not be valid for a hypothetical machine that is not race-tolerant or provides hardware race detection. - end note]
ISO C ++ 14 standard (project N3797)

My guess is that optimizers play it safely and currently prefer not to release speculative loads into potentially shared memory, rather than a special case for each target processor, whether a speculative load can represent a detectable data race for this purpose.

To implement this, the compiler generates a conditional branch. This is usually not noticeable, because modern processors have very complex industry prediction, and the error prediction rate is usually very low. However, the data here is random - this kills the branch prediction. The cost of erroneous prediction is from 10 to 20 CPU cycles, given that the CPU, as a rule, postpones 2 instructions per cycle, this is equivalent to 20-40 instructions. If the forecast rate is 50% (random), then each iteration has an incorrect prediction, equivalent to 10-20 instructions - HUGE .

Note. . The compiler can prove that the elements v[0] will refer to v[v.size()-2] in this order, regardless of the values ​​that they contain. This would allow the compiler in this case to generate code that unconditionally loads everything except the last element of the vector. The last element of the vector in v [v.size () - 1] can only be loaded into the last iteration of the loop and only if the first condition is true. Therefore, the compiler can generate code for the loop without branching the short circuit to the last iteration, and then use another code with the short circuit branch for the last iteration - this requires the compiler to know that the data is random, and branch prediction is useless and therefore worth working with these - compilers are not that complicated.

To avoid the conditional branch generated by the logical AND (& &), and to avoid loading memory cells into local variables, we can change the logical AND operator to Bitwise AND, a piece of code here , the result is almost 4 times faster when the data is random

 int f2() { int n = 0; for (int i = 1; i < v.size()-1; ++i) n += (v[i-1] < v[i]) & (v[i] < v[i+1]); // Bitwise AND return n; } 

Exit

 3.642443ms min 3.779982ms median Result: 166634 3.725968ms min 3.870808ms median Result: 166634 1.052786ms min 1.081085ms median Result: 166634 done 

The result on gcc 5.3 is 8 times faster ( live in Coliru here )

 g++ --version g++ -std=c++14 -O3 -Wall -Wextra -pedantic -pthread -pedantic-errors main.cpp -lm && ./a.out g++ (GCC) 5.3.0 Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 3.761290ms min 4.025739ms median Result: 166634 3.823133ms min 4.050742ms median Result: 166634 0.459393ms min 0.505011ms median Result: 166634 done 

You may wonder how the compiler can evaluate the comparison v[i-1] < v[i] without generating a conditional branch. The answer depends on the goal, for x86 this is possible because of the SETcc instruction, which generates one byte result, 0 or 1, depending on the state in the EFLAGS register, the same condition that could be used in a conditional branch, but without branching . In the generated code provided by @deniss, you can see the generated setl that sets the result to 1 if the less condition is met, which is evaluated by the previous comparison command:

 cmp edx, edi ; a < b ? setl r10b ; r10b = a < b ? 1 : 0 mov ecx, dword ptr [r8 + 4*rsi + 4] ; c = v[i+1] lea rsi, [rsi + 1] ; ++i cmp edi, ecx ; b < c ? setl dl ; dl = b < c ? 1 : 0 and dl, r10b ; dl &= r10b movzx edx, dl ; edx = zero extended dl add eax, edx ; n += edx 
+41
Apr 05 '16 at 3:08 on
source share

f0 and f1 are semantically distinct.

x() && y() enables a short circuit in the case where x () is false, as we know. This means that if x () is false, then y () should not be evaluated.

This prevents prefetching data to evaluate y () and (at least on clang) causes a conditional branch to be inserted, which leads to misses in the forecast branch.

Adding two more tests confirms the point.

 #include <algorithm> #include <chrono> #include <random> #include <iomanip> #include <iostream> #include <vector> using namespace std; using namespace std::chrono; vector<int> v(1'000'000); int f0() { int n = 0; for (int i = 1; i < v.size()-1; ++i) { int a = v[i-1]; int b = v[i]; int c = v[i+1]; if (a < b && b < c) ++n; } return n; } int f1() { int n = 0; auto s = v.size() - 1; for (size_t i = 1; i < s; ++i) if (v[i-1] < v[i] && v[i] < v[i+1]) ++n; return n; } int f2() { int n = 0; auto s = v.size() - 1; for (size_t i = 1; i < s; ++i) { auto t1 = v[i-1] < v[i]; auto t2 = v[i] < v[i+1]; if (t1 && t2) ++n; } return n; } int f3() { int n = 0; auto s = v.size() - 1; for (size_t i = 1; i < s; ++i) { n += 1 * (v[i-1] < v[i]) * (v[i] < v[i+1]); } return n; } int main() { auto benchmark = [](int (*f)()) { const int N = 100; volatile long long result = 0; vector<long long> timings(N); for (int i = 0; i < N; ++i) { auto t0 = high_resolution_clock::now(); result += f(); auto t1 = high_resolution_clock::now(); timings[i] = duration_cast<nanoseconds>(t1-t0).count(); } sort(timings.begin(), timings.end()); cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n"; cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n"; }; mt19937 generator (31415); // deterministic seed uniform_int_distribution<> distribution(0, 1023); for (auto& e: v) e = distribution(generator); benchmark(f0); benchmark(f1); benchmark(f2); benchmark(f3); cout << "\ndone\n"; return 0; } 

results (apple clang, -O2):

 1.233948ms min 1.320545ms median Result: 166850 3.366751ms min 3.493069ms median Result: 166850 1.261948ms min 1.361748ms median Result: 166850 1.251434ms min 1.353653ms median Result: 166850 
+7
Apr 05 '16 at 9:04 on
source share

None of the answers so far have given the f() version that gcc or clang can be fully optimized. They all generate asm, which compares each iteration. See code with asm output in Godbolt compiler explorer . (Knowledge of basic knowledge for predicting performance from asm output: Agner Fog microarchitecture guide and other x86 wiki tag links. As always, it usually works best with performance counters to find kiosks.)

v[i-1] < v[i] is the work that we already did in the last iteration, when we rated v[i] < v[i+1] . Theoretically, helping the grok compiler, which would allow it to optimize better (see f3() ). In practice, this leads to a victory in auto-integration in some cases, and gcc emits code with partial-register scores, even with -mtune=core2 , where this is a huge problem.

Manually v.size() - 1 from checking the top of the loop seems to help. OP f0 and f1 do not actually recompile v.size() from start / end pointers to v , but for some reason they are still less efficient than when calculating size_t upper = v.size() - 1 outside the loop ( f2() and f4() ).

A separate issue is that using an int loop counter with an upper bound to size_t means that the loop is potentially infinite. I'm not sure what effect this has on other optimizations.




Bottom line: compilers are complex animals . Predicting which version will be well optimized is not obvious or simple.




Results on 64-bit Ubuntu 15.10, on Core2 E6600 (Merom / Conroe microarchitecture).

 clang++-3.8 -O3 -march=core2 | g++ 5.2 -O3 -march=core2 | gcc 5.2 -O2 (default -mtune=generic) f0 1.825ms min(1.858 med) | 5.008ms min(5.048 med) | 5.000 min(5.028 med) f1 4.637ms min(4.673 med) | 4.899ms min(4.952 med) | 4.894 min(4.931 med) f2 1.292ms min(1.323 med) | 1.058ms min(1.088 med) (autovec) | 4.888 min(4.912 med) f3 1.082ms min(1.117 med) | 2.426ms min(2.458 med) | 2.420 min(2.465 med) f4 1.291ms min(1.341 med) | 1.022ms min(1.052 med) (autovec) | 2.529 min(2.560 med) 

The results will differ from the hardware of the Intel SnB family, especially. IvyBridge and later, where there would be no partial slowdown of registers. Core2 is limited to slow unloaded loads and only one load per cycle. The loops can be small enough to decode this is not a problem.




f0 and f1 :

gcc 5.2: OP f0 and f1 both create branchy loops and will not automatically draw. However, f0 uses only one branch and uses the strange setl sil / cmp sil, 1 / sbb eax, -1 to do the second half of the short circuit comparison. Therefore, it still performs both comparisons at each iteration.

clang 3.8: f0 : only one load per iteration, but both are compared and and together. f1 : both compare each iteration, one with a branch, to preserve the semantics of C. Two loads on the iteration.




 int f2() { int n = 0; size_t upper = v.size()-1; // difference from f0: hoist upper bound and use size_t loop counter for (size_t i = 1; i < upper; ++i) { int a = v[i-1], b = v[i], c = v[i+1]; if (a < b && b < c) ++n; } return n; } 

gcc 5.2 -O3 : auto-vectorize with three loads to get the three displacement vectors needed to create one vector from 4 comparison results. In addition, after combining the results of the two pcmpgtd instructions, they compare them with the all-zero vector, and then with the masks. Zero is already an identification element to add, so really stupid.

clang 3.8 -O3 : expands: each iteration performs two loads, three cmp / setcc, two and s, and two add s.




 int f4() { int n = 0; size_t upper = v.size()-1; for (size_t i = 1; i < upper; ++i) { int a = v[i-1], b = v[i], c = v[i+1]; bool ab_lt = a < b; bool bc_lt = b < c; n += (ab_lt & bc_lt); // some really minor code-gen differences from f2: auto-vectorizes to better code that runs slightly faster even for this large problem size } return n; } 
  • gcc 5.2 -O3 : autovectorizes as f2 , but without extra pcmpeqd .
  • gcc 5.2 -O2 : did not investigate why it is twice as fast as f2 .
  • clang -O3 : about the same code as f2 .



Compiler capture attempt

 int f3() { int n = 0; int a = v[0], b = v[1]; // These happen before checking v.size, defeating the loop vectorizer or something bool ab_lt = a < b; size_t upper = v.size()-1; for (size_t i = 1; i < upper; ++i) { int c = v[i+1]; // only one load and compare inside the loop bool bc_lt = b < c; n += (ab_lt & bc_lt); ab_lt = bc_lt; a = b; // unused inside the loop, only the compare result is needed b = c; } return n; } 
  • clang 3.8 -O3 : unfolds with four loads inside the loop (usually clang likes to unfold on 4 if there are no complex dependencies on the loop).
    4 cmp / setcc, 4x and / movzx, 4x add. So clang did exactly what I was hoping for and did an almost optimal scalar code. It was the fastest non-vector version , and (on kernel2, where movups loads slow) runs as fast as gcc-vectorized versions.

  • gcc 5.2 -O3 : Cannot automate vectorization. My theory is that accessing an array outside of a loop confuses the autointervalizer. Maybe because we do this before checking v.size() or maybe just generally.

    Compiles the scalar code that we hope for with one download, one cmp / setcc and one and per iteration. But gcc creates a batch with an incomplete register , even with -mtune=core2 , where there is a huge problem (2 to 3 cycles to insert a uop merge when reading wide reg after writing only part of it). ( setcc is only available with an 8-bit operand size, which IMO is what AMD should have changed when they designed the ISA AMD64.) This is the main reason gcc code runs 2.5x slower than clang's.

    / li>

 ## the loop in f3(), from gcc 5.2 -O3 (same code with -O2) .L31: add rcx, 1 # i, mov edi, DWORD PTR [r10+rcx*4] # a, MEM[base: _19, index: i_13, step: 4, offset: 0] cmp edi, r8d # a, a # gcc verbose-asm comments are a bit bogus here: one of these `a`s is from the last iteration, so this is really comparing c, b mov r8d, edi # a, a setg sil #, tmp124 and edx, esi # D.111089, tmp124 # PARTIAL-REG STALL: reading esi after writing sil movzx edx, dl # using movzx to widen sil to esi would have solved the problem, instead of doing it after the and add eax, edx # n, D.111085 # n += ... cmp r9, rcx # upper, i mov edx, esi # ab_lt, tmp124 jne .L31 #, ret 



+2
Apr 11 '16 at 21:49
source share



All Articles