Imagine 1234 threads and 16 processors. One thread receives a spin lock, then the OS performs a task switch. Now you have 16 processors, each of which works with one of the remaining 1233 threads, and they all spin completely pointless, since in any case it is required that the OS give processor time to a single thread that can free spin-lock. This means that the entire OS can basically be locked (while all processors go smoothly) for several seconds. This is seriously inhibited; so how do you fix it?
You fix this without using spinlocks in user space. Spinlocks should only be used when / when task switches can be disabled; and only the kernel should be able to disable task switches.
In particular, you need to use a mutex. Now the mutex can start spinning before surrendering and causing the thread to wait for a lock, and (for typical / low competition cases) this really helps, but it will still be a mutex and is not a spin lock.
Further; for normal software it is important (for performance) to avoid lock conflicts and then make sure that the unmanaged case is fast (and a good mutex will not cause a task switch if there is no competition). You measure a controversial / irrelevant case.
Finally; your castle is bad. To avoid overuse of the lock prefix, you should check if you can purchase without the lock prefix, and only when you can purchase if you use the lock prefix. Intel (and probably many other people) call this strategy "test, then (test and test)." Also, you did not understand the purpose of pause (or "rep nop" for assemblers that are so bad that they do not support 10-year instructions).
A decent half spin lock might look something like this:
acquire: lock bts dword [myLock],0 ;Optimistically attempt to acquire jnc .acquired ;It was acquired! .retry: pause cmp dword [myLock],0 ;Should we attempt to acquire again? jne .retry ; no, don't use `lock` lock bts dword [myLock],0 ;Attempt to acquire jc .retry ;It wasn't acquired, so go back to waiting .acquired: ret release: mov dword [myLock],0 ;No lock prefix needed here as "myLock" is aligned ret
Also note that if you could not adequately minimize the likelihood of blocking, then you need to take care of “fairness” and not use spin lock. The problem with “unfair” direct locks is that some tasks may be lucky and always get a lock, and some tasks may fail and never get a lock, because happy tasks always got it. This has always been a problem for hard-locked locks, but for modern NUMA systems this becomes a much more likely problem. In this case, at a minimum, you should use ticket blocking.
The basic idea of blocking tickets is to ensure that tasks acquire a block in the order in which they arrive (and not in some sort of “possibly very bad” random order). For completeness, blocking tickets may look like this:
acquire: mov eax,1 lock xadd [myLock],eax ;myTicket = currentTicket, currentTicket++ cmp [myLock+4],eax ;Is it my turn? je .acquired ; yes .retry: pause cmp [myLock+4],eax ;Is it my turn? jne .retry ; no, wait .acquired: ret release: lock inc dword [myLock+4] ret
tl; dr; You should not use the wrong spinlocks to get started; but if you insist on using the wrong tool, at least incorrectly execute the wrong tool ... :-)