Cmpxchg for WORD is faster than for BYTE

Yesterday I posted this question on how to write a quick spin lock. Thanks to Corey Nelson, I seem to have found a method that is superior to the other methods discussed in my question. I use the CMPXCHG command to check if the lock is 0 and is thereby released. CMPXCHG runs on 'BYTE', WORD and DWORD . I would suggest that the instruction will work faster on BYTE . But I wrote a lock that implements each of the data types:

 inline void spin_lock_8(char* lck) { __asm { mov ebx, lck ;move lck pointer into ebx xor cl, cl ;set CL to 0 inc cl ;increment CL to 1 pause ; spin_loop: xor al, al ;set AL to 0 lock cmpxchg byte ptr [ebx], cl ;compare AL to CL. If equal ZF is set and CL is loaded into address pointed to by ebx jnz spin_loop ;jump to spin_loop if ZF } } inline void spin_lock_16(short* lck) { __asm { mov ebx, lck xor cx, cx inc cx pause spin_loop: xor ax, ax lock cmpxchg word ptr [ebx], cx jnz spin_loop } } inline void spin_lock_32(int* lck) { __asm { mov ebx, lck xor ecx, ecx inc ecx pause spin_loop: xor eax, eax lock cmpxchg dword ptr [ebx], ecx jnz spin_loop } } inline spin_unlock(<anyType>* lck) { __asm { mov ebx, lck mov <byte/word/dword> ptr [ebx], 0 } } 

Then the lock was tested using the following pseudocode (note that the lcm pointer always points to an address shared by 4):

 <int/short/char>* lck; threadFunc() { loop 10,000,000 times { spin_lock_8/16/32 (lck); spin_unlock(lck); } } main() { lck = (char/short/int*)_aligned_malloc(4, 4);//Ensures memory alignment start 1 thread running threadFunc and measure time; start 2 threads running threadFunc and measure time; start 4 threads running threadFunc and measure time; _aligned_free(lck); } 

I got the following results, measured in ms on a processor with two physical cores capable of triggering 4 threads (Ivy Bridge).

  1 thread 2 threads 4 threads 8-bit 200 700 3200 16-bit 200 500 1400 32-bit 200 900 3400 

The data show that all functions require an equal amount of time. But when multiple threads need to check if lck == 0 use 16-bit, be much faster. Why is this? I don't think this has anything to do with lck alignment?

Thanks in advance.

+7
source share
2 answers

From what I remember, the castle is working on the word (2 bytes). It was written this way when it was first introduced in 486.

If you carry a lock of a different size, it actually generates the equivalent of 2 locks (lock word A and word B for a double word). For a byte, it is probably necessary to prevent the locking of the second byte, which is somewhat similar to 2 locks ...

So, your results are consistent with CPU optimization.

+1
source

Imagine 1234 threads and 16 processors. One thread receives a spin lock, then the OS performs a task switch. Now you have 16 processors, each of which works with one of the remaining 1233 threads, and they all spin completely pointless, since in any case it is required that the OS give processor time to a single thread that can free spin-lock. This means that the entire OS can basically be locked (while all processors go smoothly) for several seconds. This is seriously inhibited; so how do you fix it?

You fix this without using spinlocks in user space. Spinlocks should only be used when / when task switches can be disabled; and only the kernel should be able to disable task switches.

In particular, you need to use a mutex. Now the mutex can start spinning before surrendering and causing the thread to wait for a lock, and (for typical / low competition cases) this really helps, but it will still be a mutex and is not a spin lock.

Further; for normal software it is important (for performance) to avoid lock conflicts and then make sure that the unmanaged case is fast (and a good mutex will not cause a task switch if there is no competition). You measure a controversial / irrelevant case.

Finally; your castle is bad. To avoid overuse of the lock prefix, you should check if you can purchase without the lock prefix, and only when you can purchase if you use the lock prefix. Intel (and probably many other people) call this strategy "test, then (test and test)." Also, you did not understand the purpose of pause (or "rep nop" for assemblers that are so bad that they do not support 10-year instructions).

A decent half spin lock might look something like this:

 acquire: lock bts dword [myLock],0 ;Optimistically attempt to acquire jnc .acquired ;It was acquired! .retry: pause cmp dword [myLock],0 ;Should we attempt to acquire again? jne .retry ; no, don't use `lock` lock bts dword [myLock],0 ;Attempt to acquire jc .retry ;It wasn't acquired, so go back to waiting .acquired: ret release: mov dword [myLock],0 ;No lock prefix needed here as "myLock" is aligned ret 

Also note that if you could not adequately minimize the likelihood of blocking, then you need to take care of “fairness” and not use spin lock. The problem with “unfair” direct locks is that some tasks may be lucky and always get a lock, and some tasks may fail and never get a lock, because happy tasks always got it. This has always been a problem for hard-locked locks, but for modern NUMA systems this becomes a much more likely problem. In this case, at a minimum, you should use ticket blocking.

The basic idea of ​​blocking tickets is to ensure that tasks acquire a block in the order in which they arrive (and not in some sort of “possibly very bad” random order). For completeness, blocking tickets may look like this:

 acquire: mov eax,1 lock xadd [myLock],eax ;myTicket = currentTicket, currentTicket++ cmp [myLock+4],eax ;Is it my turn? je .acquired ; yes .retry: pause cmp [myLock+4],eax ;Is it my turn? jne .retry ; no, wait .acquired: ret release: lock inc dword [myLock+4] ret 

tl; dr; You should not use the wrong spinlocks to get started; but if you insist on using the wrong tool, at least incorrectly execute the wrong tool ... :-)

+2
source

All Articles