Using MPI, why will my parallel code be slower than my serial code?

In general, is it possible that parallel code will be slower than serial code? mine and I'm really upset about this! What can I do?

+7
source share
2 answers

Among the other 3 key factors that determine the performance of a parallel model are:

â€ĸ Parallel task granularity; â€ĸ Communication overhead; â€ĸ Load balancing among processes. 

Granularity of parallel tasks

The granularity of the parallel task should be sufficient to overlap parallel overhead models (creating a parallel task and the relationship between them). Because communication overheads usually outperform distributed memory (DM) processes compared to thread synchronization, processes must have a higher degree of granularity. This granularity should also not compromise load balancing.

tl; dr: Your parallel tasks should be "big" to justify the overhead of parallelization.


Communication overhead

Each time one process intends to communicate with others, it has the cost of creating / sending a message, and in the case of using synchronous communication there is also the expected expected transfer of other processes. To improve the performance of your application using MPI, you need to reduce the number of messages exchanged between processes.

You can use computational redundancy between processes, instead of waiting for a result from one particular process, this result can be performed directly in each process. Of course, this is usually justified when the overhead of sharing the result overlaps with the time taken to calculate it. Another solution may be to replace synchronous communication with asynchronous communication . While in synchronous communication process that sends the message waits until another process receives it, in asynchronous communication process resumes execution immediately after returning from the send call. Thus, overlapping communication with calculation. However, to take advantage of asynchronous communication , it may be necessary to rewrite the application, and it can also be difficult to achieve a good overlap ratio.

Communication performance can be improved using high-performance communications equipment, but it can be expensive. Collaborative communications can also improve communications performance because it optimizes communications based on hardware, network, and topology.

tl; dr: Reduce the amount of communication and synchronization between parallel tasks. Usage: redundant computing, asynchronous communication, collective communications and faster communication equipment.


Inter-process load balancing

Good load balancing is important because it maximizes work performed in parallel. The load balancing is affected both by the distribution of tasks between processes and the set of resources performed by the application.

In applications that work with a fixed set of resources, you should focus on the distribution of tasks. If tasks have approximately the same amount of computation (for example, for iterations), then it is only necessary to perform the very distribution of the distribution of tasks among the processes.

But some applications can run on systems with processors at different speeds or may have subtasks with different amounts of computation. For this type of situation, in order to facilitate better load balancing, you can use the farming model task, since it can be implemented with a dynamic distribution of tasks. However, in this model, the amount of communication used can jeopardize efficiency.

Another solution is to manually configure task distribution manually. It can be tricky and tricky. However, if the set of resources is not uniform in speed and constantly changes between application execution, portability performance of the task distribution settings may be compromised.

tl; dr: Each process must come close to the same time in order to complete its work.

+14
source

As others have pointed out, there are several reasons why a parallel code may be slower than a serial code.

If you perform operations with matrices, you can more efficiently use caching of the processor cache by blocking the code. Depending on the size of the processor cache, you can increase performance up to 3-4 times. A lock essentially processes matrices in small fragments or blocks so that they fit into the cache. This reduces the need to read / write main memory and improves performance.

Another option is to use GPUs.

The above solutions work well, where most of the calculations are for floating point or integer operations

For general-purpose computing, ideally, you want your application to adapt so that at runtime it determines if there is an increase in performance when distributing workload and only distribute if it is profitable.

0
source

All Articles