As @Baile points out in the comments, this is very applied, systemic, related to the environment.
And as such, I am not going to use a hard line that mentions exactly 1 thread for each core. (or 2 threads / core in case of Hyperthreading)
As an experienced programmer with shared memory, I have seen from my own experience that the optimal number of threads (for a quad-core machine) can vary from 1 to 64 +.
Now I have listed the situations that can cause this range:
Optimal cores <# from cores
In some tasks that are very fine-grained in parallel (such as small FFTs), overhead flows are the dominant performance factor. In some cases, it is not useful to parallelize at all. In some cases, you get acceleration with two streams, but reverse scaling in 4 streams.
Another problem is resource conflict. Even if you have a very parallelizable task that can easily be broken down into 4 cores / threads, you can be narrowband in terms of memory bandwidth and cache effects. So often you find that 2 threads will be as fast as 4 threads. (as if often with very large FFTs)
Optimal threads = # cores
This is the best case. No need to explain here - one thread per core. Most embarrassing concurrent applications that are not memory or I / O related are right here.
Optimal threads> # cores
That's where it is interesting ... very interesting. Have you heard of load imbalance? How about over decomposition and job theft?
Many parallelized applications are irregular - this means that tasks are not divided into sub-tasks of the same size. Therefore, if you can ultimately divide a large task into 4 unequal sizes, assign them to 4 threads and run them on 4 cores ... the result? Poor parallel performance because 1 thread turned out to be 10 times larger than other threads.
The general solution here is to overly decompose the task into many subtasks. You can create threads for each of them (so now you get threads -> kernels). Or you can use some kind of task scheduler with a fixed number of threads. Not all tasks are suitable for both, therefore quite often the approach of over-decomposing the task to 8 or 16 threads for a 4-core machine gives optimal results.
Although spawning more streams can lead to an improved load balance, overhead is cumulative. So somewhere there is an optimal point. I saw up to 64 threads on 4 cores. But, as already mentioned, this is a very specific application. And you need to experiment.
EDIT: Extending the answer to a more direct answer to the question ...
What is the cost of context switching? Time to store and restore processor registers for different contexts?
It is very dependent on the environment - and it is difficult to measure directly.
Short answer: Very expensive. This can be a good read.
What about caches, pipelines, and various code forecasts inside the CPU? Can we say that every time we switch the context, we fight caches, pipelines and some means of decoding the code in the CPU?
Short answer: Yes . When you switch to context, you are likely to hide your pipeline and ruin all the predictors. Same thing with caches. A new thread is likely to replace the cache with new data.
There is a trick. In some applications where threads share the same data, it is possible for one thread to βheatβ the cache for another incoming thread or for a different thread in a different core using the same cache. (Although this is rare, I saw how it happened earlier on one of my NUMA machines - super-linear acceleration: 17.6x through 16 cores!?!?!)
Thus, the number of threads executed on a single core is less than the work that they can perform together compared to their serial execution?
Depends, depends ... Distracting aside, necessarily overhead. But I read an article where someone used a second thread to prefetch for the main thread ... Yes, this is crazy ...