How is thread synchronization implemented at the assembly language level?

While I am familiar with concurrent programming concepts such as mutexes and semaphores, I never understood how they are implemented at the assembly level.

I assume that there is a set of memory flags that say:

  • lock A held on stream 1
  • lock B held by thread 3
  • C lock is not supported by thread
  • etc.

But how is access to these flags synchronized between threads? Something like this naive example would only create a race condition:

mov edx, [myThreadId] wait: cmp [lock], 0 jne wait mov [lock], edx ; I wanted an exclusive lock but the above ; three instructions are not an atomic operation :( 
+22
assembly multithreading synchronization x86 concurrency
Mar 03 '10 at 1:13
source share
3 answers
  • In practice, they are usually implemented using CAS and LL / SC . (... and some rotation before abandoning the temporary cutoff of the thread - usually by calling a kernel function that switches context.)
  • If you only need spinlock , wikipedia gives you an example that handles CAS for the xchg lock xchg on x86 / x64. Thus, in the strict sense, CAS is not required to create spin locks, but some atomicity is still required. In this case, it uses an atomic operation that can write a register into memory and return the previous contents of this memory slot in one step. (To clarify a bit: the lock prefix confirms the #LOCK signal, which ensures that the current processor has exclusive access to memory. On today's CPUs, this does not have to be done this way, but the effect is the same. Using xchg we make sure that we donโ€™t get where- itโ€™s between reading and writing, because the instructions will not be interrupted halfway.Therefore, if we had an imaginary lock mov reg0, mem / lock mov mem, a couple reg1 (what we put on t), this is not quite the same - it could only be unloaded between two mobs.)
  • In modern architectures, as noted in the comments, you mostly use atomic processor primitives and consistency protocols provided by the memory subsystem.
  • For this reason, you need to not only use these primitives, but also consider the cache / memory consistency guaranteed by the architecture.
  • There may be nuances of implementation. Considering, for example, spin lock:
    • instead of a naive implementation, you should probably use, for example, a TTAS lock with some exponential deviation ,
    • on a Hyper-Threaded-enabled processor, you should probably issue pause instructions that serve as hints that you rotate so that the kernel you are working on can do something useful during this
    • you should really give up spinning and performance control for other threads after a while
    • etc...
  • this is still user mode - if you are writing a kernel, you may have other tools that you can use (since it is you who plan to create threads and process / enable / disable interrupts).
+20
Mar 03 '10 at 1:17
source share

The x86 architecture has long had an instruction called xchg that will exchange the contents of a register with a memory location. xchg has always been atomic.

There has also always been a lock prefix that can be applied to any for a single statement to make that statement an atom. Before multiprocessor systems appeared, all this really was to prevent delivery interruption in the middle of a blocked instruction. (xchg is implicitly blocked).

This article provides sample code using xchg to implement spin-lock http://en.wikipedia.org/wiki/Spinlock

As multiprocessor and later multi-core systems began to be developed, more sophisticated systems were needed to ensure that locking and xchg synchronized all memory subsystems, including the l1 cache on all processors. Around the same time, new studies on blocking and blocking algorithms showed that the atomic CompareAndSet was a more flexible primitive, so more modern processors have this as an instruction.

Addendum: Andras comments provide a "dusty old" list of instructions that allow the lock prefix. http://pdos.csail.mit.edu/6.828/2007/readings/i386/LOCK.htm

+10
Mar 03 '10 at 2:02
source share

I like to think of thread synchronization as bottom up, where the processor and operating system provide a design primitive for more complex ones.

At the processor level, you have CAS and LL / SC that allow you to run the test and store in one atomic operation ... you also have other processor designs that allow you to disable and enable interruption (however, they are considered dangerous ... for certain circumstances you have no choice but to use them)

operating system

provides the ability to switch context between tasks that can occur every time a thread uses its time slice ... or it can happen due to other reasons (I will come to this)

then there are constructions of a higher level, such as mutexes, which use these primitive mechanisms provided by the processor (I think, a rotating mutex) ... which will constantly wait until the condition becomes true and check this condition atomically

then these rotating mutex can use the functionality provided by the OS (context switch and system calls, such as the output that gives control to another thread) and gives us mutexes

these constructs are then used by higher-level constructs such as conditional variables (which can track how many threads the mutex expects and which thread to resolve first when the mutex becomes available)

These constructs, which can be used further to provide more complex synchronization constructs ... example: semaphores, etc.

+2
Mar 03
source share



All Articles