Start with an empty linked list of pending threads. The head should be set to 0.
Use CAS, compare and swap to insert the thread at the top of the waiters list. If head = -1, then do not insert and do not wait. You can safely use CAS to insert items at the head of a linked list if you do it right.
After insertion, the waiting thread should wait on SIGUSR1. Use sigwait () for this.
When ready, the signaling flow uses CAS to set the waitlist header to -1. This prevents more threads from being added to the waiting list. The signal thread then iterates through the threads in the wait list and calls pthread_kill (& thread, SIGUSR1) to wake up each waiting thread.
If SIGUSR1 is sent before the sigwait call, sigwait will return immediately. Thus, there will be no race between adding a thread to the waitlist and calling sigwait.
EDIT:
Why is CAS faster than a mutex? Defendants (I'm lay). This is faster for some things in some situations, because it has less overhead when there is no race. Therefore, if you can reduce your parallel problem to the need to change 8-16-32-64-128 bits of continuous memory, and the race will not occur very often, CAS wins. CAS is basically a slightly more bizarre / expensive mov command, where you are going to do the usual “mov” anyway. Its an "exchng lock" or something like that.
A mutex, on the other hand, is a whole bunch of extra things that pollute other cache lines and use more memory barriers, etc. Although CAS acts as a memory barrier on x86, x64, etc. Then, of course, you should unlock the mutex, which probably amounts to about the same amount of extra material.
Here's how you add an item to a linked list using CAS:
while (1) { pOldHead = pHead; <-- snapshot of the world. Start of the race. pItem->pNext = pHead; if (CAS(&pHead, pOldHead, pItem)) <-- end of the race if phead still is pOldHead break; // success }
So, how often do you think your code will have multiple threads in this CAS line at the same time? Actually ... not very often. We conducted tests that simply looped, adding millions of elements with multiple threads at the same time, and this happens in less than 1% of cases. In a real program, this will never happen.
Obviously, if there is a race, you need to go back and do this cycle again, but in the case of a linked list, what does it cost you?
The downside is that you cannot do very complicated things on this linked list if you intend to use this method to add items to the head. Try implementing a double linked list. Such a pain.
EDIT:
In the above code, I am using the CAS macro. If you are using linux, CAS = macro uses __sync_bool_compare_and_swap. See gcc atomic builtins . If you use windows, CAS = macro uses something like InterlockedCompareExchange. Here's what a built-in function in windows might look like:
inline bool CAS(volatile WORD* p, const WORD nOld, const WORD nNew) { return InterlockedCompareExchange16((short*)p, nNew, nOld) == nOld; } inline bool CAS(volatile DWORD* p, const DWORD nOld, const DWORD nNew) { return InterlockedCompareExchange((long*)p, nNew, nOld) == nOld; } inline bool CAS(volatile QWORD* p, const QWORD nOld, const QWORD nNew) { return InterlockedCompareExchange64((LONGLONG*)p, nNew, nOld) == nOld; } inline bool CAS(void*volatile* p, const void* pOld, const void* pNew) { return InterlockedCompareExchangePointer(p, (PVOID)pNew, (PVOID)pOld) == pOld; }