In order to maximize the benefits of parallelization, the task should be broken down into similar dimensional fragments of the course, which are independent (or mainly so) and require little data transfer or synchronization between pieces.
Fine-grained parallelization almost always suffers from an increase in overhead and will have a finite speed, regardless of the number of available physical cores.
[Beware of this, these are architectures that have a very large no. "cores" (such as 64,000 core cores). They are well suited for calculations that can be broken down into relatively simple actions assigned to a particular topology (for example, a rectangular grid).]
source
share