How to make the same calculations faster on a 4-core processor: 4 threads or 50 threads?

Suppose we fixed the number of calculations without blocking, sleeping, waiting for I / O. The work can be very well parallelized - it consists of small and independent tasks of calculating 100M.

What is faster for a 4-core processor - start 4 threads or ... say, 50? Why should the second option be shortened and how many?

As I believe: when starting 4 heavy threads on a 4-core processor without other processes / processor threads, the scheduler is allowed to not move threads between the cores at all; it has no reason to do it in this situation. Core0 (the main processor) will be responsible for executing the interrupt handler for the hardware timer 250 times per second (the basic Linux configuration) and other hardware interrupts, but other kernels may not bother.

What is the cost of context switching? Storage and recovery time of processor registers for different contexts? What about caches, pipelines, and various software forecasts inside the processor? Can we say that every time we switch context, we fight caches, pipelines and some means of encoding code in the CPU? Thus, more threads running on a single core, less work that they can perform together compared to sequential execution?

The issue of caches and other hardware optimization in a multi-threaded environment is an interesting question for me now.

+7
source share
5 answers

As @Baile points out in the comments, this is very applied, systemic, related to the environment.

And as such, I am not going to use a hard line that mentions exactly 1 thread for each core. (or 2 threads / core in case of Hyperthreading)

As an experienced programmer with shared memory, I have seen from my own experience that the optimal number of threads (for a quad-core machine) can vary from 1 to 64 +.

Now I have listed the situations that can cause this range:

Optimal cores <# from cores

In some tasks that are very fine-grained in parallel (such as small FFTs), overhead flows are the dominant performance factor. In some cases, it is not useful to parallelize at all. In some cases, you get acceleration with two streams, but reverse scaling in 4 streams.

Another problem is resource conflict. Even if you have a very parallelizable task that can easily be broken down into 4 cores / threads, you can be narrowband in terms of memory bandwidth and cache effects. So often you find that 2 threads will be as fast as 4 threads. (as if often with very large FFTs)

Optimal threads = # cores

This is the best case. No need to explain here - one thread per core. Most embarrassing concurrent applications that are not memory or I / O related are right here.

Optimal threads> # cores

That's where it is interesting ... very interesting. Have you heard of load imbalance? How about over decomposition and job theft?

Many parallelized applications are irregular - this means that tasks are not divided into sub-tasks of the same size. Therefore, if you can ultimately divide a large task into 4 unequal sizes, assign them to 4 threads and run them on 4 cores ... the result? Poor parallel performance because 1 thread turned out to be 10 times larger than other threads.

The general solution here is to overly decompose the task into many subtasks. You can create threads for each of them (so now you get threads -> kernels). Or you can use some kind of task scheduler with a fixed number of threads. Not all tasks are suitable for both, therefore quite often the approach of over-decomposing the task to 8 or 16 threads for a 4-core machine gives optimal results.


Although spawning more streams can lead to an improved load balance, overhead is cumulative. So somewhere there is an optimal point. I saw up to 64 threads on 4 cores. But, as already mentioned, this is a very specific application. And you need to experiment.


EDIT: Extending the answer to a more direct answer to the question ...

What is the cost of context switching? Time to store and restore processor registers for different contexts?

It is very dependent on the environment - and it is difficult to measure directly.
Short answer: Very expensive. This can be a good read.

What about caches, pipelines, and various code forecasts inside the CPU? Can we say that every time we switch the context, we fight caches, pipelines and some means of decoding the code in the CPU?

Short answer: Yes . When you switch to context, you are likely to hide your pipeline and ruin all the predictors. Same thing with caches. A new thread is likely to replace the cache with new data.

There is a trick. In some applications where threads share the same data, it is possible for one thread to β€œheat” the cache for another incoming thread or for a different thread in a different core using the same cache. (Although this is rare, I saw how it happened earlier on one of my NUMA machines - super-linear acceleration: 17.6x through 16 cores!?!?!)

Thus, the number of threads executed on a single core is less than the work that they can perform together compared to their serial execution?

Depends, depends ... Distracting aside, necessarily overhead. But I read an article where someone used a second thread to prefetch for the main thread ... Yes, this is crazy ...

+11
source

If you can use 4 threads, use them. There is no way for 50 to move faster than 4 on a 4-core computer. All you have is more overhead.

Of course, you are describing an ideal situation that is not related to the real world, therefore, no matter what you build, you will need to measure in order to understand how performance affects.

0
source

There is Hyperthreading technology that can process more than one thread per processor, but it hardly depends on the type of computation you want to do. Consider using a graphics processor or a very low assembly language for maximum power.

0
source

Creating 50 threads will really hurt performance, not improve it. It just doesn't make any sense.

Ideally, you should make 4 threads, not more, not less. There will be some overhead due to context switching, but this is inevitable. OS / services / other applications threads must also be executed. But these days with such powerful and fast processors, this is not a concern, since these OS threads will occupy less than 2% of the processor time. Almost all of them will be blocked while your program is running.

You might think that since performance is critical, you should encode these small critical areas in low-level assembly language. This allows modern programming languages.

But seriously ... compilers and, in the case of Java, the JVM, will optimize these portions so well that it just isn't worth it (unless you really want to do something similar). Instead of your calculations ending in 100 seconds, they end at 97 or 98. The question you need to ask yourself is: is all these hours of coding and debugging worth it?

You asked about the time cost of context switching. These days they are extremely low. Look at the modern dual-core processors that run Windows 7, for example. If you run the Apache web server on this computer and the MySQL database server, you can easily switch to over 800 threads. The car just does not feel it. To find out how low this cost is, read here: How to estimate the overhead of switching flows? . To save you the search / read part: context switching can be done hundreds of thousands of times per second .

0
source

4 threads are faster if you can program 40 tasks, switching better than the operating system.

0
source

All Articles