Any advantage of MPI + CUDA over pure MPI?

A common way to speed up an application is to parallelize the application using MPI or higher libraries, such as PETSc, which use MPI under the hood.

However, currently everyone seems interested in using CUDA to parallelize their application or using hybrid MPI and CUDA to solve larger / larger tasks.

Is there a noticeable advantage in using the MPI + CUDA hybrid programming model compared to the traditional, tried and tested MPI parallel programming model? I set this specifically in the areas of application of particle methods

One of the reasons why I ask this question is because everywhere on the Internet I see the statement that β€œparticle methods naturally represent the architecture of GPUs” or some changes to this. But they never justify why I’d better use CUDA than just use MPI for the same job.

+4
source share
2 answers

These are some apples and oranges.

MPI and CUDA are fundamentally different architectures. Most importantly, MPI allows you to distribute your application across multiple nodes, and CUDA allows you to use the GPU in a local node. If your parallel processes take too much time in the MPI program, then yes, you should see how they can be accelerated using a graphics processor rather than a processor to do its job. Conversely, if your CUDA application still takes too much time, you can extend the work to multiple nodes using MPI.

The two technologies are pretty orthogonal (assuming all nodes in your cluster are CUDA compatible).

+12
source

Just to rely on another poster, there is a good answer, some high-level discussions about what problems GPUs have, and why.

GPUs have evolved differently compared to processors due to their different origins. Compared to cores, GPU cores contain more ALUs and FP hardware, as well as less control logic and cache. This means that GPUs can provide greater efficiency for direct computing, but only code with a regular control flow and smart memory access patterns will have the best advantage: up to TFLOPS for SP FP code. GPUs are designed for high-performance devices with high latency at the control and memory levels. The globally available memory has a long wide bus, so that soklokalnye (continuous and consistent) memory access provides good bandwidth, despite the long delay. The delays are hidden, requiring a massive stream of parallelism and providing essentially zero context switching to hardware. GPUs use a SIMD-like SIMT model, in which groups of cores are executed in SIMD lock (different groups can diverge freely), without forcing the programmer to reckon with this fact (except to achieve the best performance: on Fermi this can make a difference of up to 32x). SIMT lends itself to a parallel data programming model, whereby data independence is used to perform similar processing on a large data array. Attempts are being made to generalize the GPUs and their programming model, as well as to facilitate programming for good performance.

+1
source