Why do I have the worst performance in my spinlock implementation when I use the non-cst memory model?

Question

Why do I have the worst performance in my spinlock implementation when I use the non-cst memory model?

I have two versions of the spin block below. The former uses the default value, which is memory_order_cst, while the latter uses memory_order_acquire / memory_order_release. Since the latter is more relaxed, I expect it to have better performance. However, this does not seem to be the case.

class SimpleSpinLock { public: inline SimpleSpinLock(): mFlag(ATOMIC_FLAG_INIT) {} inline void lock() { int backoff = 0; while (mFlag.test_and_set()) { DoWaitBackoff(backoff); } } inline void unlock() { mFlag.clear(); } private: std::atomic_flag mFlag = ATOMIC_FLAG_INIT; }; class SimpleSpinLock2 { public: inline SimpleSpinLock2(): mFlag(ATOMIC_FLAG_INIT) {} inline void lock() { int backoff = 0; while (mFlag.test_and_set(std::memory_order_acquire)) { DoWaitBackoff(backoff); } } inline void unlock() { mFlag.clear(std::memory_order_release); } private: std::atomic_flag mFlag = ATOMIC_FLAG_INIT; }; const int NUM_THREADS = 8; const int NUM_ITERS = 5000000; const int EXPECTED_VAL = NUM_THREADS * NUM_ITERS; int val = 0; long j = 0; SimpleSpinLock spinLock; void ThreadBody() { for (int i = 0; i < NUM_ITERS; ++i) { spinLock.lock(); ++val; j = i * 3.5 + val; spinLock.unlock(); } } int main() { vector<thread> threads; for (int i = 0; i < NUM_THREADS; ++i) { cout << "Creating thread " << i << endl; threads.push_back(std::move(std::thread(ThreadBody))); } for (thread& thr: threads) { thr.join(); } cout << "Final value: " << val << "\t" << j << endl; assert(val == EXPECTED_VAL); return 1; }

I am working on Ubuntu 12.04 with gcc 4.8.2 performing O3 optimization.

- Spinlock with memory_order_cst:

 Run 1: real 0m1.588s user 0m4.548s sys 0m0.052s Run 2: real 0m1.577s user 0m4.580s sys 0m0.032s Run 3: real 0m1.560s user 0m4.436s sys 0m0.032s

- Spinlock with memory_order_acquire / release:

 Run 1: real 0m1.797s user 0m4.608s sys 0m0.100s Run 2: real 0m1.853s user 0m4.692s sys 0m0.164s Run 3: real 0m1.784s user 0m4.552s sys 0m0.124s Run 4: real 0m1.475s user 0m3.596s sys 0m0.120s

With a more relaxed model, I see much greater variability. Sometimes it's better. It is often worse, does anyone have an explanation?

+6

c ++ performance multithreading c ++ 11 lock-free

Nathan doromal May 7, '14 at 15:28

source share

1 answer

kec · Accepted Answer · 2014-05-07T16:09:03+0000

The generated unlock code is different. The CST memory model (with g ++ 4.9.0) generates:

  movb %sil, spinLock(%rip) mfence

to unlock. Acquisition / exemption generates:

  movb %sil, spinLock(%rip)

The lock code is the same. Someone else will say something about why it is better with a fence, but if I had to guess, I would suggest that this reduces the likelihood of matching between buses and cache, possibly by reducing interference on the bus. Sometimes a more strict order and therefore faster.

ADD: According to this , costs cost about 100 cycles. So maybe you are reducing competition for tires because when a thread finishes the body of the loop, it stops a bit before trying to regain lock, allowing another thread to terminate. You can try to do the same by setting a short delay cycle after unlocking, although you need to make sure that it is not optimized.

ADDENDUM2: this seems to be caused by interference / bus conflicts caused by too fast a cycle. I added a short delay loop, for example:

  spinLock.unlock(); for (int i = 0; i < 5; i++) { j = i * 3.5 + val; }

Now receiving / issuing does the same.

Why do I have the worst performance in my spinlock implementation when I use the non-cst memory model?

More articles: