The answer is that MPI_Bcast is probably faster than a loop, in the general case. In general, MPI teams are optimized for a wide range of message sizes, message sizes, and specific rank views.
However, it is possible that you can beat a team with specific message sizes, message sizes, and rank ranks. For example, a loop on non-blocking point-to-point calls (e.g. ISend and Recv / IRecv) may be faster ... but probably only with a few specific message sizes, message sizes, and rank ranks.
If the particular algorithm that you are coding needs a Bcast template (for example, all ranks get the same payload from the root), then use the Bcast collective. In general, you should not add complications, minimizing your own "collective replacements".
If there is some other message template that the algorithm needs, and Bcast is only a partial fit ... then it may be worth moving your own ... but I personally set this bar high enough.
source share