Let's look at the implementation of the OpenMPI barrier . Although other implementations may vary slightly, the overall exchange pattern should be identical.
First of all, it should be noted that the MPI barrier has no installation costs: the process that reaches the MPI_Barrier call is blocked until all other members of the group also call MPI_Barrier . Note that MPI does not require that they reach the same call, only any call to MPI_Barrier . Therefore, since the total number of nodes in the group is already known to each process, an additional state is not required to initiate a call.
Now let's look at some code:
[...] int mca_coll_basic_barrier_intra_lin(struct ompi_communicator_t *comm, mca_coll_base_module_t *module) { int i; int err; int size = ompi_comm_size(comm); int rank = ompi_comm_rank(comm);
First, all nodes (except one with a rank of 0, node root) send a notification that they have reached a barrier to the node root:
if (rank > 0) { err = MCA_PML_CALL(send (NULL, 0, MPI_BYTE, 0, MCA_COLL_BASE_TAG_BARRIER, MCA_PML_BASE_SEND_STANDARD, comm)); if (MPI_SUCCESS != err) { return err; }
After that, they block the pending notification from the root:
err = MCA_PML_CALL(recv (NULL, 0, MPI_BYTE, 0, MCA_COLL_BASE_TAG_BARRIER, comm, MPI_STATUS_IGNORE)); if (MPI_SUCCESS != err) { return err; } }
The root root implements the other side of the connection. First, it blocks until it receives n-1 notifications (one from each node in the group, except for itself, since it is already in the barrier call):
else { for (i = 1; i < size; ++i) { err = MCA_PML_CALL(recv(NULL, 0, MPI_BYTE, MPI_ANY_SOURCE, MCA_COLL_BASE_TAG_BARRIER, comm, MPI_STATUS_IGNORE)); if (MPI_SUCCESS != err) { return err; } }
As soon as all notifications arrive, they send messages that each node is waiting for, signaling that everyone has reached the barrier, after which he himself leaves the barrier call:
for (i = 1; i < size; ++i) { err = MCA_PML_CALL(send (NULL, 0, MPI_BYTE, i, MCA_COLL_BASE_TAG_BARRIER, MCA_PML_BASE_SEND_STANDARD, comm)); if (MPI_SUCCESS != err) { return err; } } } return MPI_SUCCESS; }
Thus, the communication structure first represents n:1 from all nodes to the root, and then a 1:n from the root to all nodes. To avoid overloading the root node with queries, OpenMPI allows you to use a tree-based communication template, but the basic idea is the same: all nodes notify the root when they enter the barrier, and the root aggregates the results and informs them every time they are ready to continue.