Measuring how slow atomic increments are compared to ordinary integer increments

Question

Measuring how slow atomic increments are compared to ordinary integer increments

A recent discussion made me wonder how expensive an atomic increment is compared to a regular whole increment.

I wrote code to try and compare this:

#include <iostream> #include <atomic> #include <chrono> static const int NUM_TEST_RUNS = 100000; static const int ARRAY_SIZE = 500; void runBenchmark(std::atomic<int>& atomic_count, int* count_array, int array_size, bool do_atomic_increment){ for(int i = 0; i < array_size; ++i){ ++count_array[i]; } if(do_atomic_increment){ ++atomic_count; } } int main(int argc, char* argv[]){ int num_test_runs = NUM_TEST_RUNS; int array_size = ARRAY_SIZE; if(argc == 3){ num_test_runs = atoi(argv[1]); array_size = atoi(argv[2]); } if(num_test_runs == 0 || array_size == 0){ std::cout << "Usage: atomic_operation_overhead <num_test_runs> <num_integers_in_array>" << std::endl; return 1; } // Instantiate atomic counter std::atomic<int> atomic_count; // Allocate the integer buffer that will be updated every time int* count_array = new int[array_size]; // Track the time elapsed in case of incrmeenting with mutex locking auto start = std::chrono::steady_clock::now(); for(int i = 0; i < num_test_runs; ++i){ runBenchmark(atomic_count, count_array, array_size, true); } auto end = std::chrono::steady_clock::now(); // Calculate time elapsed for incrementing without mutex locking auto diff_with_lock = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start); std::cout << "Elapsed time with atomic increment for " << num_test_runs << " test runs: " << diff_with_lock.count() << " ns" << std::endl; // Track the time elapsed in case of incrementing without a mutex locking start = std::chrono::steady_clock::now(); for(unsigned int i = 0; i < num_test_runs; ++i){ runBenchmark(atomic_count, count_array, array_size, false); } end = std::chrono::steady_clock::now(); // Calculate time elapsed for incrementing without mutex locking auto diff_without_lock = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start); std::cout << "Elapsed time without atomic increment for " << num_test_runs << " test runs: " << diff_without_lock.count() << " ns" << std::endl; auto difference_running_times = diff_with_lock - diff_without_lock; auto proportion = difference_running_times.count() / (double)diff_without_lock.count(); std::cout << "How much slower was locking: " << proportion * 100.0 << " %" << std::endl; // We loop over all entries in the array and print their sum // We do this mainly to prevent the compiler from optimizing out // the loop where we increment all the values in the array int array_sum = 0; for(int i = 0; i < array_size; ++i){ array_sum += count_array[i]; } std::cout << "Array sum (just to prevent loop getting optimized out): " << array_sum << std::endl; delete [] count_array; return 0; }

The problem I am facing is that this program produces results with different discrepancies in each run:

 balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 1000 500 Elapsed time with atomic increment for 1000 test runs: 99852 ns Elapsed time without atomic increment for 1000 test runs: 96396 ns How much slower was locking: 3.58521 % balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 1000 500 Elapsed time with atomic increment for 1000 test runs: 182769 ns Elapsed time without atomic increment for 1000 test runs: 138319 ns How much slower was locking: 32.1359 % balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 1000 500 Elapsed time with atomic increment for 1000 test runs: 98858 ns Elapsed time without atomic increment for 1000 test runs: 96404 ns How much slower was locking: 2.54554 % balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 1000 500 Elapsed time with atomic increment for 1000 test runs: 107848 ns Elapsed time without atomic increment for 1000 test runs: 105174 ns How much slower was locking: 2.54245 % balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 1000 500 Elapsed time with atomic increment for 1000 test runs: 113865 ns Elapsed time without atomic increment for 1000 test runs: 100559 ns How much slower was locking: 13.232 % balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 1000 500 Elapsed time with atomic increment for 1000 test runs: 98956 ns Elapsed time without atomic increment for 1000 test runs: 106639 ns How much slower was locking: -7.20468 %

This makes me think that there might be a mistake in the benchmarking code itself. Is there some kind of error that I am missing? Is using std :: chrono for benchmarking wrong? Or is it a time difference due to the overhead of signal processing in the OS related to atomic operations?

What can i do wrong?

Test bench:

 Intel® Core™ i7-4700MQ CPU @ 2.40GHz × 8 8GB RAM GNU/Linux:Ubuntu LTS 14.04 (64 bit) GCC version: 4.8.4 Compilation: g++ -std=c++11 -O3 atomic_operation_overhead.cpp -o atomic_operation_overhead

EDIT: Updated test run output after compilation with -O3 optimization.

EDIT: after running the tests to increase the number of iterations and adding the loop amount to prevent optimization from the loop, as suggested by Adam, I get more converging results:

 balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 Elapsed time with atomic increment for 99999999 test runs: 7111974931 ns Elapsed time without atomic increment for 99999999 test runs: 6938317779 ns How much slower was locking: 2.50287 % Array sum (just to prevent loop getting optimized out): 1215751192 balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 Elapsed time with atomic increment for 99999999 test runs: 7424952991 ns Elapsed time without atomic increment for 99999999 test runs: 7262721866 ns How much slower was locking: 2.23375 % Array sum (just to prevent loop getting optimized out): 1215751192 balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 Elapsed time with atomic increment for 99999999 test runs: 7172114343 ns Elapsed time without atomic increment for 99999999 test runs: 7030985219 ns How much slower was locking: 2.00725 % Array sum (just to prevent loop getting optimized out): 1215751192 balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 Elapsed time with atomic increment for 99999999 test runs: 7094552104 ns Elapsed time without atomic increment for 99999999 test runs: 6971060941 ns How much slower was locking: 1.77148 % Array sum (just to prevent loop getting optimized out): 1215751192 balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 Elapsed time with atomic increment for 99999999 test runs: 7099907902 ns Elapsed time without atomic increment for 99999999 test runs: 6970289856 ns How much slower was locking: 1.85958 % Array sum (just to prevent loop getting optimized out): 1215751192 balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 Elapsed time with atomic increment for 99999999 test runs: 7763604675 ns Elapsed time without atomic increment for 99999999 test runs: 7229145316 ns How much slower was locking: 7.39312 % Array sum (just to prevent loop getting optimized out): 1215751192 balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 Elapsed time with atomic increment for 99999999 test runs: 7164534212 ns Elapsed time without atomic increment for 99999999 test runs: 6994993609 ns How much slower was locking: 2.42374 % Array sum (just to prevent loop getting optimized out): 1215751192 balajeerc@Balajee :~/Projects/Misc$ ./atomic_operation_overhead 99999999 500 Elapsed time with atomic increment for 99999999 test runs: 7154697145 ns Elapsed time without atomic increment for 99999999 test runs: 6997030700 ns How much slower was locking: 2.25333 % Array sum (just to prevent loop getting optimized out): 1215751192

+5

atomic c ++ 11 stdatomic

balajeerc 21 sept '15 at 4:45

source share

1 answer

Adam · Accepted Answer · 2015-09-21T05:47:59+0000

Some thoughts:

Perform more iterations, at least enough to take a few seconds. Your runs take milliseconds, so interrupting the I / O may be enough to distort the results.
Print the amount at the end. The compiler can be smart enough to optimize your loops much more than you think, so your code can do less work than you think. If the compiler sees that the value is never read, it can completely delete your loops.
Iterate all in one loop, unlike a loop calling a function. Although the compiler is likely to include your function call, it is best not to introduce another potential noise source.
I'm sure you will do this further, but add a threaded test. Could do this for both; you will get the wrong amount in a non-atomic variable due to races, but at least you will see how much you pay for consistency.

Measuring how slow atomic increments are compared to ordinary integer increments

More articles: