Deliberately creating more threads than processors is the standard technique used to use โspare loopsโ when a thread blocks waiting for something, be it I / O, mutex, or something else, providing some other useful work for processor to do.
If your threads do I / O, this is a strong competitor for speeding up: since each thread blocks waiting for I / O, the processor can start other threads until they block I / O too much, I hope that by this time the data for the first thread is ready and etc.
Another possible cause of acceleration is that your threads experience false separation . If you have two streams that write data to different values โโin the same cache line (for example, adjacent elements of the array), this blocks the CPU while the cache line is being sent back and forth. By adding more threads, you reduce the likelihood that they work on neighboring elements, and thereby reduce the likelihood of a false exchange. You can easily test this by adding an extra addition to your data elements so that they are at least 64 bytes in size (typical cache line size). If your 4-thread code is speeding up, this is the problem.
Anthony williams
source share