Why is std :: none_of faster than a manual loop?

I compared the performance of std::none_of with three different manual implementations using i) a for loop, ii) a range based for loop, and iii) iterators. To my surprise, I found that although all three manual implementations take about the same time, std::none_of much faster. My question is: why is this so?

I used the Google test library and compiled it with -std=c++14 -O3 . When starting the test, I limited the proximity of the process to one processor. I get the following result with GCC 6.2:

 Benchmark Time CPU Iterations -------------------------------------------------------- benchmarkSTL 28813 ns 28780 ns 24283 benchmarkManual 46203 ns 46191 ns 15063 benchmarkRange 48368 ns 48243 ns 16245 benchmarkIterator 44732 ns 44710 ns 15698 

In Clang 3.9, std::none_of also faster than a manual for loop, although the difference in speed is less. Here is the test code (for brevity only, for guidance):

 #include <algorithm> #include <array> #include <benchmark/benchmark.h> #include <functional> #include <random> const size_t N = 100000; const unsigned value = 31415926; template<size_t N> std::array<unsigned, N> generateData() { std::mt19937 randomEngine(0); std::array<unsigned, N> data; std::generate(data.begin(), data.end(), randomEngine); return data; } void benchmarkSTL(benchmark::State & state) { auto data = generateData<N>(); while (state.KeepRunning()) { bool result = std::none_of( data.begin(), data.end(), std::bind(std::equal_to<unsigned>(), std::placeholders::_1, value)); assert(result); } } void benchmarkManual(benchmark::State & state) { auto data = generateData<N>(); while (state.KeepRunning()) { bool result = true; for (size_t i = 0; i < N; i++) { if (data[i] == value) { result = false; break; } } assert(result); } } BENCHMARK(benchmarkSTL); BENCHMARK(benchmarkManual); BENCHMARK_MAIN(); 

Note that generating data using a random number generator does not matter. I get the same result when I just set the i -th element to i and check if the value is N + 1 .

+6
source share
1 answer

After another investigation, I will try to answer my question. As suggested by Kerrek SB, I looked at the generated assembly code. The bottom line is that GCC 6.2 does a much better job of std::none_of loop implicit in std::none_of , compared to the other three versions.

GCC 6.2:

  • std::none_of expands 4 times → 30 μs
  • the for manual, the for range, and the iterator do not expand at all → ~ 45μs

As suggested by Corristo, the result is a compiler dependency — which makes sense. Clang 3.9 expands everything except the for range, albeit to varying degrees.

Clang 3.9

  • `std :: none_of 'expands 8 times → ~ 30μs
  • manual for unfolds 5 times → ~ 35μs
  • the for range does not expand at all → ~ 60μs
  • iterator rotates 8 times → ~ 28 μs

All code has been compiled with -std=c++14 -O3 .

+2
source

All Articles