Is it normal that gcc atomic are embedded so slowly?

Question

Is it normal that gcc atomic are embedded so slowly?

I have an application in which I need to increment some statistics counters in a multi-threaded method. The increment should be thread safe, so I decided to use the gcc atomic builtins __sync_add_and_fetch() function. To get an idea of their effects, I did some simple performance checks and noticed that these functions are much slower than a simple pre / lean increase.

Here is the test program I created:

 #include <iostream> #include <pthread.h> #include <time.h> using namespace std; uint64_t diffTimes(struct timespec &start, struct timespec &end) { if(start.tv_sec == end.tv_sec) { return end.tv_nsec - start.tv_nsec; } else if(start.tv_sec < end.tv_sec) { uint64_t nsecs = (end.tv_sec - start.tv_sec) * 1000000000; return nsecs + end.tv_nsec - start.tv_nsec; } else { // this is actually an error return 0; } } void outputResult(const char *msg, struct timespec &start, struct timespec &end, uint32_t numIterations, uint64_t val) { uint64_t diff = diffTimes(start, end); cout << msg << ": " << "\n\t iterations: " << numIterations << ", result: " << val << "\n\t times [start, end] = [" << start.tv_sec << ", " << start.tv_nsec << "]" << "\n\t [" << end.tv_sec << ", " << end.tv_nsec << "]" << "\n\t [total, avg] = [" << diff << ", " << (diff/numIterations) << "] nano seconds" << endl; } int main(int argc, char **argv) { struct timespec start, end; uint64_t val = 0; uint32_t numIterations = 1000000; // // NON ATOMIC pre increment // clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for(uint32_t i = 0; i < numIterations; ++i) { ++val; } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); outputResult("Non-Atomic pre-increment", start, end, numIterations, val); val = 0; // // NON ATOMIC post increment // clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for(uint32_t i = 0; i < numIterations; ++i) { val++; } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); outputResult("Non-Atomic post-increment", start, end, numIterations, val); val = 0; // // ATOMIC add and fetch // clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for(uint32_t i = 0; i < numIterations; ++i) { __sync_add_and_fetch(&val, 1); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); outputResult("Atomic add and fetch", start, end, numIterations, val); val = 0; // // ATOMIC fetch and add // clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for(uint32_t i = 0; i < numIterations; ++i) { __sync_fetch_and_add(&val, 1); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); outputResult("Atomic fetch and add", start, end, numIterations, val); val = 0; // // Mutex protected post-increment // pthread_mutex_t mutex; pthread_mutex_init(&mutex, NULL); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for(uint32_t i = 0; i < numIterations; ++i) { pthread_mutex_lock(&mutex); val++; pthread_mutex_unlock(&mutex); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); outputResult("Mutex post-increment", start, end, numIterations, val); val = 0; // // RWlock protected post-increment // pthread_rwlock_t rwlock; pthread_rwlock_init(&rwlock, NULL); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for(uint32_t i = 0; i < numIterations; ++i) { pthread_rwlock_wrlock(&rwlock); val++; pthread_rwlock_unlock(&rwlock); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); outputResult("RWlock post-increment", start, end, numIterations, val); val = 0; return 0; }

And here are the results:

 # ./atomicVsNonAtomic Non-Atomic pre-increment: iterations: 1000000, result: 1000000 times [start, end] = [0, 1585375] [0, 1586185] [total, avg] = [810, 0] nano seconds Non-Atomic post-increment: iterations: 1000000, result: 1000000 times [start, end] = [0, 1667489] [0, 1667920] [total, avg] = [431, 0] nano seconds Atomic add and fetch: iterations: 1000000, result: 1000000 times [start, end] = [0, 1682037] [0, 16595016] [total, avg] = [14912979, 14] nano seconds Atomic fetch and add: iterations: 1000000, result: 1000000 times [start, end] = [0, 16617178] [0, 31499571] [total, avg] = [14882393, 14] nano seconds Mutex post-increment: iterations: 1000000, result: 1000000 times [start, end] = [0, 31526810] [0, 68515763] [total, avg] = [36988953, 36] nano seconds RWlock post-increment: iterations: 1000000, result: 1000000 times [start, end] = [0, 68547649] [0, 110877351] [total, avg] = [42329702, 42] nano seconds

Here is the gcc compilation:

 g++ -o atomicVsNonAtomic.o -c -march=i686 -O2 -I. atomicVsNonAtomic.cc g++ -o atomicVsNonAtomic atomicVsNonAtomic.o -lrt -lpthread

And related information and versions:

 # gcc --version gcc (GCC) 4.3.2 Copyright (C) 2008 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # uname -a Linux gtcba2v1 2.6.32.12-0.7-default #1 SMP 2010-05-20 11:14:20 +0200 i686 i686 i386 GNU/Linux

And now for the urgent question :) Is it normal that atomic operations are so slower?

Difference for a million iterations:

simple post-increment: 431 nano seconds
atomic fetch and add operation: 14882393 nano seconds

Of course, I understand that an atomic operation should be more expensive, but that seems exaggerated. For completeness, I also checked the pthread and rwlock mutexes. At least atomic operations are faster than pthread operations, but Im still wondering if I did something wrong. I could not bind it without specifying the -march=i686 option, maybe this affected?

UPDATE:

I extracted the -O2 compiler optimization and was able to get more consistent results as follows:

 # ./atomicVsNonAtomic Non-Atomic pre-increment: iterations: 1000000, result: 1000000 times [start, end] = [0, 1647303] [0, 4171164] [total, avg] = [2523861, 2] nano seconds Non-Atomic post-increment: iterations: 1000000, result: 1000000 times [start, end] = [0, 4310230] [0, 7262704] [total, avg] = [2952474, 2] nano seconds Atomic add and fetch: iterations: 1000000, result: 1000000 times [start, end] = [0, 7285996] [0, 25919067] [total, avg] = [18633071, 18] nano seconds Atomic fetch and add: iterations: 1000000, result: 1000000 times [start, end] = [0, 25941677] [0, 44544234] [total, avg] = [18602557, 18] nano seconds Mutex post-increment: iterations: 1000000, result: 1000000 times [start, end] = [0, 44573933] [0, 82318615] [total, avg] = [37744682, 37] nano seconds RWlock post-increment: iterations: 1000000, result: 1000000 times [start, end] = [0, 82344866] [0, 125124498] [total, avg] = [42779632, 42] nano seconds

+7

c ++ performance gcc atomic built-in

Brady Jul 23 '12 at 8:27

source share

3 answers

I assume gcc is optimizing your non-atomic increment operation on something like

 val += numIterations;

You say that 10 ^ 6 increments take 431 nanoseconds, which is 0.000431 ns per iteration of the loop. On a 4 GHz processor, the clock cycle is 0.25 ns, so it’s pretty obvious that the cycle is optimized. This explains the big difference in performance that you see.

Edit: you measured the atomic operation as receiving 14 ns - assuming the 4 GHz processor runs again up to 56 cycles, which is pretty decent!

+6

Martin B Jul 23 '12 at 8:40

source share

The slowness of any synchronization mechanism cannot be measured with a thread. Single-line synchronization objects, such as POSIX mutexes / Windows critical sections, really require time when they are disputed.

You will need to submit several threads — doing another job that reflects the time of your real application — for synchronized methods to get a real idea of how long it will take.

+1

Puppy Jul 23 '12 at 9:35

source share

interjay · Accepted Answer · 2012-07-23T08:40:18+0000

The answer is that GCC optimizes your non-atomic increments. When he sees a cycle like:

 for (int i=0; i<N; i++) x++;

he replaces it:

 x += N;

This can be seen in the generated assembly, which contains:

 call clock_gettime leal -32(%ebp), %edx addl $1000000, -40(%ebp) <- increment by 1000000 adcl $0, -36(%ebp) movl %edx, 4(%esp) movl $2, (%esp) call clock_gettime

So you are not measuring what you consider yourself.

You can make your variable volatile to prevent this optimization. On my computer, after that, non-atomic access is about 8 times faster than atomic access. When using a 32-bit variable instead of 64-bit (I am compiling 32-bit) the difference is reduced to about 3.

Is it normal that gcc atomic are embedded so slowly?

More articles: