MPI_Bcast: performance benefits?

Question

MPI_Bcast: performance benefits?

MPI_Bcast MPI have the MPI_Bcast function a purely convenient function, or is there an advantage to using it instead of just sorting through all the ranks and sending the same message to all of them?

Rationale: MPI_Bcast behavior of sending a message to everyone, including the root, is inconvenient for me, so I would prefer not to use it if it has no good reason, or you can force it to not send the message to the root.

+4

performance c parallel-processing mpi message-passing

dsimcha Aug 05 '11 at 16:37

source share

4 answers

Collective communications can be much faster than your own. All MPI attempts take a lot of time to make these procedures quick.

If you usually want to make objects of a collective type, but only on a subset of tasks, then you probably want to create your own sub-communicators and use BCAST, etc. on these communicators.

+3

Jonathan dursi Aug 05 '11 at 17:09

source share

MPI_Bcast sends a message from one process ("root") to all others, by definition. It will probably be a little faster than just a cycle of all processes. For example, the mpich2 implementation uses a binomial tree to distribute a message.

If you do not want to pass to MPI_COMM_WORLD, but you want to define subgroups, you can do this as follows:

 #include <stdio.h> #include "mpi.h" #define NPROCS 8 int main(int argc, char **argv) { int rank, new_rank, sendbuf, recvbuf, ranks1[4]={0,1,2,3}, ranks2[4]={4,5,6,7}; MPI_Group orig_group, new_group; MPI_Comm new_comm; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); sendbuf = rank; /* Extract the original group handle */ MPI_Comm_group(MPI_COMM_WORLD, &orig_group); /* Divide tasks into two groups based on rank */ if (rank < NPROCS/2) { MPI_Group_incl(orig_group, NPROCS/2, ranks1, &new_group); } else { MPI_Group_incl(orig_group, NPROCS/2, ranks2, &new_group); } /* Create new communicator and then perform some comm * Here, MPI_Allreduce, but you can MPI_Bcast at will */ MPI_Comm_create(MPI_COMM_WORLD, new_group, &new_comm); MPI_Allreduce(&sendbuf, &recvbuf, 1, MPI_INT, MPI_SUM, new_comm); MPI_Group_rank (new_group, &new_rank); printf("rank= %d newrank= %d recvbuf= %d\n", rank, new_rank, recvbuf); MPI_Finalize(); }

Which might cause something like the following output:

 rank= 7 newrank= 3 recvbuf= 22 rank= 0 newrank= 0 recvbuf= 6 rank= 1 newrank= 1 recvbuf= 6 rank= 2 newrank= 2 recvbuf= 6 rank= 6 newrank= 2 recvbuf= 22 rank= 3 newrank= 3 recvbuf= 6 rank= 4 newrank= 0 recvbuf= 22 rank= 5 newrank= 1 recvbuf= 22

+2

Michael Foukarakis Aug 05 '11 at 17:06

source share

The answer is that MPI_Bcast is probably faster than a loop, in the general case. In general, MPI teams are optimized for a wide range of message sizes, message sizes, and specific rank views.

However, it is possible that you can beat a team with specific message sizes, message sizes, and rank ranks. For example, a loop on non-blocking point-to-point calls (e.g. ISend and Recv / IRecv) may be faster ... but probably only with a few specific message sizes, message sizes, and rank ranks.

If the particular algorithm that you are coding needs a Bcast template (for example, all ranks get the same payload from the root), then use the Bcast collective. In general, you should not add complications, minimizing your own "collective replacements".

If there is some other message template that the algorithm needs, and Bcast is only a partial fit ... then it may be worth moving your own ... but I personally set this bar high enough.

+2

Stan graves Aug 9 '11 at 16:16

source share

Shawn chin · Accepted Answer · 2011-08-05T17:05:17+0000

Using MPI_Bcast will definitely be more efficient than scanning your own. In all MPI implementations, much work has been done to optimize collective operations based on factors such as message size and communication architecture.

For example, MPI_Bcast in MPICH2 will use a different algorithm depending on the size of the message . For short messages, the binary tree is used to minimize processing load and latency. For long messages, it is implemented as a scatter of the binary tree, followed by allgather.

In addition, HPC vendors often provide MPI implementations that leverage basic interconnects, especially for collective operations. For example, you can use hardware support for a multicast scheme or use custom algorithms that can use existing interconnects .

MPI_Bcast: performance benefits?

More articles: