MPI Internal Interfaces: Implementing Inter-Process Communication

I am trying to understand how the process actually communicates inside MPI communicators. I have 8 nodes, each of which has 12 cores (96 copies work). Each process has a unique rank, and processes can interact with each other. So, how do processes get a unique rank and let you send actual messages? According to some slides, there is an Open Run-Time Environment (ORTE), which "Remains on the machine with which the processes start in this cell (for example, on the front panel of the cluster). It is responsible for starting processes in the cell. (Nodes, processes). State transfer cells to the rest of the universe. Routing communication between cells. " I was unable to find developer documentation and / or MPI implementation architecture documents. Everyone has ideas how the actual connection between the MPI processes is realized, that is, how do they manage to find themselves and get the assigned ranks? Is there a central or several internal MPI internal processes for routing (e.g. behind a node)?

Thanks David

+4
source share
1 answer

The mechanisms you are talking about are strictly implementation dependent. MPI is a mid-level standard that sits on top of what communication mechanisms are provided by the hardware and operating system.

ORTE is part of Open MPI - one of the universal MPI implementations in the modern world. There are also MPICH and MPICH2 and their variants (for example, Intel MPI). Most supercomputer vendors provide their own MPI implementations (for example, IBM provides a modified MPICH2 for Blue Gene / Q).

The Open MPI method is that it is divided into several layers, and each level of functionality is provided by many modules that load dynamically. There is a scoring mechanism that should choose the best module under certain conditions.

All MPI implementations provide a mechanism for launching the so-called SPMD launch. In fact, the MPI application is a special kind of SPMD (Single Program Multiple Data) - many copies of one executable file are executed, and message transfer is used as a mechanism for communication and coordination. This is the SPMD launcher, which takes a list of execution nodes, starts the process remotely, and establishes the connection and the communication scheme between them (in Open MPI, this is called MPI Universe). This is the one that creates the MPI_COMM_WORLD global MPI communicator and distributes the initial rank assignment, and it can provide options such as binding processes to CPU cores (which is very important for NUMA systems). After starting the processes, some identification mechanism is available (for example, mapping between ranks and IP address / TCP port) other addressing schemes can be used. Open MPI, for example, starts remote processes using ssh , rsh or can use the mechanisms provided by various resource management systems (for example, PBS / Torque, SLURM, Grid Engine, LSF ...). After the processes are completed, their IP addresses and port numbers are recorded and broadcast in the Universe, processes can find each other in other (faster) networks, for example. InfiniBand, and establish communication routes on them.

Routing messages are usually not executed by the MPI itself, but remain in the core communication network. MPI only takes care of creating the messages and then sends them to the network, which will be delivered to the destination. Shared memory is usually used to communicate between processes that are on the same node.

If you are interested in technical details, I would recommend that you read the Open MPI source code. You can find it in the project website .

+12
source

Source: https://habr.com/ru/post/1411871/


All Articles