The generated unlock code is different. The CST memory model (with g ++ 4.9.0) generates:
movb %sil, spinLock(%rip) mfence
to unlock. Acquisition / exemption generates:
movb %sil, spinLock(%rip)
The lock code is the same. Someone else will say something about why it is better with a fence, but if I had to guess, I would suggest that this reduces the likelihood of matching between buses and cache, possibly by reducing interference on the bus. Sometimes a more strict order and therefore faster.
ADD: According to this , costs cost about 100 cycles. So maybe you are reducing competition for tires because when a thread finishes the body of the loop, it stops a bit before trying to regain lock, allowing another thread to terminate. You can try to do the same by setting a short delay cycle after unlocking, although you need to make sure that it is not optimized.
ADDENDUM2: this seems to be caused by interference / bus conflicts caused by too fast a cycle. I added a short delay loop, for example:
spinLock.unlock(); for (int i = 0; i < 5; i++) { j = i * 3.5 + val; }
Now receiving / issuing does the same.
source share