Understanding block and block-cyclic matrix distributions

When working with parallel matrix decompositions, I am familiar with the block distribution, where we have (say) 4 processes, each with its own matrix subregion:

Block Matrix Decomposition

So, for example, here the number of processes in the row ( procrows ) is 2, the number of processes in the column ( proccols ) is also equal to two, and the A_local submatrix will have size N/2 x M/2 if the initial size of the matrix is N x M

I am reading this example , which uses a "block-cyclic" distribution, and in this part:

 /* Begin Cblas context */ /* We assume that we have 4 processes and place them in a 2-by-2 grid */ int ctxt, myid, myrow, mycol, numproc; int procrows = 2, proccols = 2; Cblacs_pinfo(&myid, &numproc); Cblacs_get(0, 0, &ctxt); Cblacs_gridinit(&ctxt, "Row-major", procrows, proccols); 

they have procrows and proccols hardcoded, fine, but there is a heading for the matrix that reads:

Nb and Mb will be the number of rows and columns of blocks [matrices]

I do not understand this; aren’t Nb and Mb completely determined by N, M, protrusions and puncture?


EDIT

From running the example, I see that the submatrix of process 0 has all the elements of the left corner of the matrix, as in my picture above, something that contradicts Jonathan's answer. However, it works great with ScaLAPACK Cholesky.

+5
source share
1 answer

Block decompositions of matrices, as you described in your question, are a perfectly acceptable way of spreading a matrix, but this is not the only way to do this.

In particular, the distribution of data blocks (partitioning the matrix into procrows x process submatrices) is a little inflexible. If the size of the matrix is ​​not divided by the number of processes in a row or column - and, as a rule, you do not have control over the size of the matrix, and only a little flexibility with procrows / proccols - you may encounter serious problems of load balancing. In addition, it is sometimes very convenient to "decompose" a problem; to break it down into more items than you have tasks. In particular, for MPI, since each task is a process, it is sometimes useful to have several submatrices for each process to work, so that you can deal with this additional level of parallelism using streaming (which is built into most linear algebra libraries with one technology) .

The way to get maximum flexibility for load balancing and to have the maximum degree of interprocess parallelism available is a purely cyclical distribution. In a 1d cyclic distribution, say, by dividing 15 elements between 4 processors, processor 1 will receive element 1, 2 will receive element 2, 3 will receive element 3, 4 will receive 4, and then processor 1 will receive element 5, and therefore, at; you exchange items on all processors.

In the 1d-block decomposition, on the other hand, processor 1 will receive elements 1-4, processor 2 will receive 5-9, etc.

The following is a diagram from a useful LLNL parallel computing tutorial , with each color coding in which the processor received a data region:

enter image description here

Thus, cyclic decomposition is most effective for parallelism and load balancing, but it is terrible for data access; each adjacent piece of data that you want to receive in order to perform linear algebra operations is non-processor. Block decomposition, on the other hand, is most efficient for data access; you have the largest possible contiguous piece of data, so you can perform operations with matrices on good large submatrices; but it is inflexible for parallelism and can cost in terms of load balancing.

Block-Cyclic - interpolation between them; you can decompose the matrix into blocks and cyclically distribute these blocks between processes. This allows you to adjust the tradeoff between data access connectivity and flexibility. If the block-cyclic block size is 1, you have a cyclic distribution; if they are N/procrows or N/proccols , you have a block distribution; but you can also have something in between.

Please note that in 2D you can, in principle, choose different decompositions along rows and columns, and sometimes what is useful if your matrix will be used in only one type of calculation; but the more common case is that the decomposition is the same among all dimensions, so when people say "block decomposition" or "cyclic decomposition", they usually mean that across all dimensions.

A good description of this can be found on the pages .

+7
source

All Articles