Cmpxchg atomics average latency for intel cpus


I am looking for some reference to average latencies for cmpxchg lock command for various Intel processors. I can not find any good link on this topic, and any link will help a lot.

Thanks.

+7
multithreading x86 atomic lock-free
source share
5 answers

There are several, if any, good references to this, because there are so many variations. It mainly depends on everything, including bus speed, memory speed, processor speed, number of processors, surrounding instructions, memory fencing and, possibly, the angle between the moon and the world of Everest ...

If you have a very specific application, for example, in well-known (fixed) hardware, operating environment, real-time operating system and exclusive control, then perhaps this will make a difference. In this case, the benchmark. Unless you have that level of control over where your software works, any measurements are virtually meaningless.

As discussed in these answers , locks are implemented using CAS, so if you can leave CAS instead of blocking (which will require at least two operations), it will be faster (noticeable? Only possible).

The best links you'll find are Intel Software Developer Guides , although since there are so many variations, they won’t give you the actual number. However, they will tell you how to achieve maximum performance. Perhaps the processor specification (for example, here for the i7 Extreme Edition in the "Technical Documents" section) will give you the actual numbers (or at least the range).

+4
source share

The best reference to the latency of the x86 instruction is probably found in the Agner optimization guides based on actual empirical measurements on various Intel / AMD / VIA chips and is often updated for the latest processors on the market.

Unfortunately, I do not see the CMPXCHG instruction specified in the command latency tables, but on page 4 it says:

Instructions with the LOCK prefix have a long delay, which depends on the organization of the cache and, possibly, on the speed of RAM. If there are multiple processors or cores or direct memory access (DMA) devices, then all blocked instructions block the cache line for exclusive access, which may include access to RAM. The LOCK prefix usually costs more than a hundred clock cycles, even on uniprocessor systems. This also applies to an XCHG instruction with a memory operand.

+3
source share

I have been looking at exponential deferral for several months.

CAS latency depends entirely on whether the instruction can work from the cache or should work from memory. Typically, a given memory address is CAS'd by number of threads (for example, a pointer to a write to a queue). If the last successful CAS was executed by a logical processor that uses a cache with the current CAS executing device (L1, L2 or L3, although, of course, higher levels are slower), then the instruction will work in the cache and will be fast - several cycles. If the last successful CAS was executed by a logical kernel that does not use a cache with the current excutor, then writing the last CASer will invalidate the cache line for the current executor, and reading the memory will be required - this will take hundreds of cycles.

The CAS operation itself is very fast - a few cycles - a memory problem.

+2
source share

I tried to compare CAS and DCAS in terms of NOP.

I have some results, but I do not trust them yet - verification continues.

I am currently seeing on Core i5 for NOP CAS / DCAS 3/5. On Xeon, I see 20/22.

These results may be completely incorrect - you have been warned.

0
source share

You can use the AIDA64 software to check for delays with instructions (but you cannot check which of the instructions to check has a hard list of instructions). People publish the results at http://instlatx64.atw.hu/

In lock instructions, AIDA64 checks the lock add and xchg [mem] (which always lock even without explicit prefix locking).

Here are some details. I will also give you, by comparison, delays in the following instructions:

  • xchg reg1, reg2 , which is not blocked;
  • add for registers and memory.

As you can see, locking instructions are only 5 times slower on Haswell-DT and about 2 times slower on Kaby Lake-S than non-blocking storage.

Intel Core i5-4430, 3000 MHz (30 x 100) Haswell-DT

 LOCK ADD [m8], r8 L: 5.96ns= 17.8c T: 7.21ns= 21.58c LOCK ADD [m16], r16 L: 5.96ns= 17.8c T: 7.21ns= 21.58c LOCK ADD [m32], r32 L: 5.96ns= 17.8c T: 7.21ns= 21.58c LOCK ADD [m32 + 8], r32 L: 5.96ns= 17.8c T: 7.21ns= 21.58c LOCK ADD [m64], r64 L: 5.96ns= 17.8c T: 7.21ns= 21.58c LOCK ADD [m64 + 16], r64 L: 5.96ns= 17.8c T: 7.21ns= 21.58c XCHG r8, [m8] L: 5.96ns= 17.8c T: 7.21ns= 21.58c XCHG r16, [m16] L: 5.96ns= 17.8c T: 7.21ns= 21.58c XCHG r32, [m32] L: 5.96ns= 17.8c T: 7.21ns= 21.58c XCHG r64, [m64] L: 5.96ns= 17.8c T: 7.21ns= 21.58c ADD r32, 0x04000 L: 0.22ns= 0.9c T: 0.09ns= 0.36c ADD r32, 0x08000 L: 0.22ns= 0.9c T: 0.09ns= 0.36c ADD r32, 0x10000 L: 0.22ns= 0.9c T: 0.09ns= 0.36c ADD r32, 0x20000 L: 0.22ns= 0.9c T: 0.08ns= 0.34c ADD r8, r8 L: 0.22ns= 0.9c T: 0.05ns= 0.23c ADD r16, r16 L: 0.22ns= 0.9c T: 0.07ns= 0.29c ADD r32, r32 L: 0.22ns= 0.9c T: 0.05ns= 0.23c ADD r64, r64 L: 0.22ns= 0.9c T: 0.07ns= 0.29c ADD r8, [m8] L: 1.33ns= 5.6c T: 0.11ns= 0.47c ADD r16, [m16] L: 1.33ns= 5.6c T: 0.11ns= 0.47c ADD r32, [m32] L: 1.33ns= 5.6c T: 0.11ns= 0.47c ADD r64, [m64] L: 1.33ns= 5.6c T: 0.11ns= 0.47c ADD [m8], r8 L: 1.19ns= 5.0c T: 0.32ns= 1.33c ADD [m16], r16 L: 1.19ns= 5.0c T: 0.21ns= 0.88c ADD [m32], r32 L: 1.19ns= 5.0c T: 0.22ns= 0.92c ADD [m32 + 8], r32 L: 1.19ns= 5.0c T: 0.22ns= 0.92c ADD [m64], r64 L: 1.19ns= 5.0c T: 0.20ns= 0.85c ADD [m64 + 16], r64 L: 1.19ns= 5.0c T: 0.18ns= 0.73c 

Intel Core i7-7700K, 4700 MHz (47 x 100) Kaby Lake-S

 LOCK ADD [m8], r8 L: 4.01ns= 16.8c T: 5.12ns= 21.50c LOCK ADD [m16], r16 L: 4.01ns= 16.8c T: 5.12ns= 21.50c LOCK ADD [m32], r32 L: 4.01ns= 16.8c T: 5.12ns= 21.50c LOCK ADD [m32 + 8], r32 L: 4.01ns= 16.8c T: 5.12ns= 21.50c LOCK ADD [m64], r64 L: 4.01ns= 16.8c T: 5.12ns= 21.50c LOCK ADD [m64 + 16], r64 L: 4.01ns= 16.8c T: 5.12ns= 21.50c XCHG r8, [m8] L: 4.01ns= 16.8c T: 5.12ns= 21.50c XCHG r16, [m16] L: 4.01ns= 16.8c T: 5.12ns= 21.50c XCHG r32, [m32] L: 4.01ns= 16.8c T: 5.20ns= 21.83c XCHG r64, [m64] L: 4.01ns= 16.8c T: 5.12ns= 21.50c ADD r32, 0x04000 L: 0.33ns= 1.0c T: 0.12ns= 0.36c ADD r32, 0x08000 L: 0.31ns= 0.9c T: 0.12ns= 0.37c ADD r32, 0x10000 L: 0.31ns= 0.9c T: 0.12ns= 0.36c ADD r32, 0x20000 L: 0.31ns= 0.9c T: 0.12ns= 0.36c ADD r8, r8 L: 0.31ns= 0.9c T: 0.11ns= 0.34c ADD r16, r16 L: 0.31ns= 0.9c T: 0.11ns= 0.32c ADD r32, r32 L: 0.31ns= 0.9c T: 0.11ns= 0.34c ADD r64, r64 L: 0.31ns= 0.9c T: 0.10ns= 0.31c ADD r8, [m8] L: 1.87ns= 5.6c T: 0.16ns= 0.47c ADD r16, [m16] L: 1.87ns= 5.6c T: 0.16ns= 0.47c ADD r32, [m32] L: 1.87ns= 5.6c T: 0.16ns= 0.47c ADD r64, [m64] L: 1.87ns= 5.6c T: 0.16ns= 0.47c ADD [m8], r8 L: 1.89ns= 5.7c T: 0.33ns= 1.00c ADD [m16], r16 L: 1.87ns= 5.6c T: 0.26ns= 0.78c ADD [m32], r32 L: 1.87ns= 5.6c T: 0.28ns= 0.84c ADD [m32 + 8], r32 L: 1.89ns= 5.7c T: 0.26ns= 0.78c ADD [m64], r64 L: 1.89ns= 5.7c T: 0.33ns= 1.00c ADD [m64 + 16], r64 L: 1.89ns= 5.7c T: 0.24ns= 0.73c 
0
source share

All Articles