In a library that uses Monitor.PulseAll () to synchronize streams, I noticed that the latency since PulseAll (...) appears causes the time when the stream woke up, it seems to follow the distribution of the “staircase” - - with extremely large in steps. Awakened threads almost do not work; and almost immediately return to waiting on the monitor. For example, on a box with 12 cores with 24 threads waiting on the monitor (2x Xeon5680 / Gulftown, 6 physical cores per processor, HT Disabled), the latency between the pulse and the wake of the thread is as follows:

The first 12 threads (note that we have 12 cores) require 30 to 60 microseconds to respond. Then we start to get very big jumps; with a plateau of about 700, 1300, 1900 and 2600 microseconds.
I was able to successfully recreate this behavior regardless of the third-party library using the code below. What this code does is launch a large number of threads (changing the numThreads parameter), which simply wait on the monitor, read the timestamp, register it in the ConcurrentSet, and then immediately return to waiting. As soon as the second PulseAll () wakes up all threads. It does this 20 times and reports delays for the 10th iteration of the console.
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading; using System.Threading.Tasks; using System.Collections.Concurrent; using System.Diagnostics; namespace PulseAllTest { class Program { static long LastTimestamp; static long Iteration; static object SyncObj = new object(); static Stopwatch s = new Stopwatch(); static ConcurrentBag<Tuple<long, long>> IterationToTicks = new ConcurrentBag<Tuple<long, long>>(); static void Main(string[] args) { long numThreads = 32; for (int i = 0; i < numThreads; ++i) { Task.Factory.StartNew(ReadLastTimestampAndPublish, TaskCreationOptions.LongRunning); } s.Start(); for (int i = 0; i < 20; ++i) { lock (SyncObj) { ++Iteration; LastTimestamp = s.Elapsed.Ticks; Monitor.PulseAll(SyncObj); } Thread.Sleep(TimeSpan.FromSeconds(1)); } Console.WriteLine(String.Join("\n", from n in IterationToTicks where n.Item1 == 10 orderby n.Item2 select ((decimal)n.Item2)/TimeSpan.TicksPerMillisecond)); Console.Read(); } static void ReadLastTimestampAndPublish() { while(true) { lock(SyncObj) { Monitor.Wait(SyncObj); } IterationToTicks.Add(Tuple.Create(Iteration, s.Elapsed.Ticks - LastTimestamp)); } } } }
Using the code above, an example of latency in a box with hyperthreading enabled with 8 cores / w (i.e. 16 cores in the task manager) and 32 threads (* 2x Xeon5550 / Gainestown, 4 physical cores per processor, HT Enabled):

EDIT: To try to deduce NUMA from the equation, the following is a graph that runs an example program with 16 threads on Core i7-3770 (Ivy Bridge); 4 physical cores; HT Enabled:

Can someone explain why Monitor.PulseAll () behaves this way?
EDIT2:
To try to show that this behavior is not an integral part of waking up a bunch of threads at the same time, I replicated the behavior of the test program using events; and instead of measuring the latency of PulseAll (), I measure the latency of ManualResetEvent.Set (). The code creates several workflows and then waits for the ManualResetEvent.Set () event on the same ManualResetEvent object. When an event fires, they take a delay measurement and then immediately wait for their own separate AutoResetEvent stream. Until the next iteration (up to 500 ms), the ManualResetEvent is set to Reset (), and then each AutoResetEvent parameter is Set (), so the threads can return to waiting for the shared ManualResetEvent.
I did not dare to publish this because it could be a giant red rumor (I do not make any claims to events and monitors that behave in the same way), plus it uses some absolutely terrible methods to make the Event behave like a Monitor (I would loved / hated look what my colleagues will do if I submit this to a code review); but I think the results are enlightening.
This test was run on the same machine as the original test; 2xXeon5680 / Gulftown; 6 cores per processor (total 12 cores); Hyperthreading is disabled.

If it is not clear how radically different this is from Monitor.PulseAll; here is the first graph superimposed on the last graph:

The code used to create these measurements is shown below:
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading; using System.Threading.Tasks; using System.Collections.Concurrent; using System.Diagnostics; namespace MRETest { class Program { static long LastTimestamp; static long Iteration; static ManualResetEventSlim MRES = new ManualResetEventSlim(false); static List<ReadLastTimestampAndPublish> Publishers = new List<ReadLastTimestampAndPublish>(); static Stopwatch s = new Stopwatch(); static ConcurrentBag<Tuple<long, long>> IterationToTicks = new ConcurrentBag<Tuple<long, long>>(); static void Main(string[] args) { long numThreads = 24; s.Start(); for (int i = 0; i < numThreads; ++i) { AutoResetEvent ares = new AutoResetEvent(false); ReadLastTimestampAndPublish spinner = new ReadLastTimestampAndPublish( new AutoResetEvent(false)); Task.Factory.StartNew(spinner.Spin, TaskCreationOptions.LongRunning); Publishers.Add(spinner); } for (int i = 0; i < 20; ++i) { ++Iteration; LastTimestamp = s.Elapsed.Ticks; MRES.Set(); Thread.Sleep(500); MRES.Reset(); foreach (ReadLastTimestampAndPublish publisher in Publishers) { publisher.ARES.Set(); } Thread.Sleep(500); } Console.WriteLine(String.Join("\n", from n in IterationToTicks where n.Item1 == 10 orderby n.Item2 select ((decimal)n.Item2) / TimeSpan.TicksPerMillisecond)); Console.Read(); } class ReadLastTimestampAndPublish { public AutoResetEvent ARES { get; private set; } public ReadLastTimestampAndPublish(AutoResetEvent ares) { this.ARES = ares; } public void Spin() { while (true) { MRES.Wait(); IterationToTicks.Add(Tuple.Create(Iteration, s.Elapsed.Ticks - LastTimestamp)); ARES.WaitOne(); } } } } }