How is a barrier implemented in messaging systems?

I understand that one master process sends a message to all other processes. All other processes in response send a message to the master process. Will this be enough for a barrier to work? If not, what else is needed?

+6
source share
2 answers

Let's look at the implementation of the OpenMPI barrier . Although other implementations may vary slightly, the overall exchange pattern should be identical.

First of all, it should be noted that the MPI barrier has no installation costs: the process that reaches the MPI_Barrier call is blocked until all other members of the group also call MPI_Barrier . Note that MPI does not require that they reach the same call, only any call to MPI_Barrier . Therefore, since the total number of nodes in the group is already known to each process, an additional state is not required to initiate a call.

Now let's look at some code:

 /* * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, * University of Stuttgart. All rights reserved. * Copyright (c) 2004-2005 The Regents of the University of California. * All rights reserved. * Copyright (c) 2012 Oak Ridge National Labs. All rights reserved. * [...] */ [...] /* * barrier_intra_lin * * Function: - barrier using O(N) algorithm * Accepts: - same as MPI_Barrier() * Returns: - MPI_SUCCESS or error code */ int mca_coll_basic_barrier_intra_lin(struct ompi_communicator_t *comm, mca_coll_base_module_t *module) { int i; int err; int size = ompi_comm_size(comm); int rank = ompi_comm_rank(comm); 

First, all nodes (except one with a rank of 0, node root) send a notification that they have reached a barrier to the node root:

  /* All non-root send & receive zero-length message. */ if (rank > 0) { err = MCA_PML_CALL(send (NULL, 0, MPI_BYTE, 0, MCA_COLL_BASE_TAG_BARRIER, MCA_PML_BASE_SEND_STANDARD, comm)); if (MPI_SUCCESS != err) { return err; } 

After that, they block the pending notification from the root:

  err = MCA_PML_CALL(recv (NULL, 0, MPI_BYTE, 0, MCA_COLL_BASE_TAG_BARRIER, comm, MPI_STATUS_IGNORE)); if (MPI_SUCCESS != err) { return err; } } 

The root root implements the other side of the connection. First, it blocks until it receives n-1 notifications (one from each node in the group, except for itself, since it is already in the barrier call):

 else { for (i = 1; i < size; ++i) { err = MCA_PML_CALL(recv(NULL, 0, MPI_BYTE, MPI_ANY_SOURCE, MCA_COLL_BASE_TAG_BARRIER, comm, MPI_STATUS_IGNORE)); if (MPI_SUCCESS != err) { return err; } } 

As soon as all notifications arrive, they send messages that each node is waiting for, signaling that everyone has reached the barrier, after which he himself leaves the barrier call:

  for (i = 1; i < size; ++i) { err = MCA_PML_CALL(send (NULL, 0, MPI_BYTE, i, MCA_COLL_BASE_TAG_BARRIER, MCA_PML_BASE_SEND_STANDARD, comm)); if (MPI_SUCCESS != err) { return err; } } } /* All done */ return MPI_SUCCESS; } 

Thus, the communication structure first represents n:1 from all nodes to the root, and then a 1:n from the root to all nodes. To avoid overloading the root node with queries, OpenMPI allows you to use a tree-based communication template, but the basic idea is the same: all nodes notify the root when they enter the barrier, and the root aggregates the results and informs them every time they are ready to continue.

+7
source

No, this is not enough. As soon as the master process sent a message to all other processes, informing them that it had reached the barrier, and all other processes replied that they had also reached the barrier, only the master process knew that all processes had reached the barrier. In this case, a different message is required from the wizard to other processes.

I do not pretend to actually implement MPI barriers in any library, in particular, I do not assume that the sequence of messages presented is used in practice, just that it is not theoretical enough.

+1
source

All Articles