Why are there more threads than cores?

I implemented the PageRank version in a multi-threaded version. I run it on a 4-core Q6600. When I run it to create 4 threads, I get:

real 6.968s user 26.020s sys 0.050s 

When I start 128 threads, I get:

 real 0.545s user 1.330s sys 0.040s 

It makes no sense to me. The main algorithm is to reduce the amount:

  • All threads summarize a subset of input;
  • Synchronize;
  • Each thread then accumulates part of the results from other threads;
  • The main thread sums the intermediate value from all threads, and then determines whether to continue.

Profiling did not help. I'm not sure what data will be useful for understanding my code - just ask.

It really puzzles me.

+7
source share
2 answers

Deliberately creating more threads than processors is the standard technique used to use โ€œspare loopsโ€ when a thread blocks waiting for something, be it I / O, mutex, or something else, providing some other useful work for processor to do.

If your threads do I / O, this is a strong competitor for speeding up: since each thread blocks waiting for I / O, the processor can start other threads until they block I / O too much, I hope that by this time the data for the first thread is ready and etc.

Another possible cause of acceleration is that your threads experience false separation . If you have two streams that write data to different values โ€‹โ€‹in the same cache line (for example, adjacent elements of the array), this blocks the CPU while the cache line is being sent back and forth. By adding more threads, you reduce the likelihood that they work on neighboring elements, and thereby reduce the likelihood of a false exchange. You can easily test this by adding an extra addition to your data elements so that they are at least 64 bytes in size (typical cache line size). If your 4-thread code is speeding up, this is the problem.

+10
source

You probably have spare CPU cycles while the thread is blocking some resources, such as memory. These processor cycles may be used by other threads. The data I would look at is ... Does the 4-thread version show 100% utilization of each core? If you did not find spare processor cycles.

+6
source

All Articles