Block decompositions of matrices, as you described in your question, are a perfectly acceptable way of spreading a matrix, but this is not the only way to do this.
In particular, the distribution of data blocks (partitioning the matrix into procrows x process submatrices) is a little inflexible. If the size of the matrix is ββnot divided by the number of processes in a row or column - and, as a rule, you do not have control over the size of the matrix, and only a little flexibility with procrows / proccols - you may encounter serious problems of load balancing. In addition, it is sometimes very convenient to "decompose" a problem; to break it down into more items than you have tasks. In particular, for MPI, since each task is a process, it is sometimes useful to have several submatrices for each process to work, so that you can deal with this additional level of parallelism using streaming (which is built into most linear algebra libraries with one technology) .
The way to get maximum flexibility for load balancing and to have the maximum degree of interprocess parallelism available is a purely cyclical distribution. In a 1d cyclic distribution, say, by dividing 15 elements between 4 processors, processor 1 will receive element 1, 2 will receive element 2, 3 will receive element 3, 4 will receive 4, and then processor 1 will receive element 5, and therefore, at; you exchange items on all processors.
In the 1d-block decomposition, on the other hand, processor 1 will receive elements 1-4, processor 2 will receive 5-9, etc.
The following is a diagram from a useful LLNL parallel computing tutorial , with each color coding in which the processor received a data region:

Thus, cyclic decomposition is most effective for parallelism and load balancing, but it is terrible for data access; each adjacent piece of data that you want to receive in order to perform linear algebra operations is non-processor. Block decomposition, on the other hand, is most efficient for data access; you have the largest possible contiguous piece of data, so you can perform operations with matrices on good large submatrices; but it is inflexible for parallelism and can cost in terms of load balancing.
Block-Cyclic - interpolation between them; you can decompose the matrix into blocks and cyclically distribute these blocks between processes. This allows you to adjust the tradeoff between data access connectivity and flexibility. If the block-cyclic block size is 1, you have a cyclic distribution; if they are N/procrows or N/proccols , you have a block distribution; but you can also have something in between.
Please note that in 2D you can, in principle, choose different decompositions along rows and columns, and sometimes what is useful if your matrix will be used in only one type of calculation; but the more common case is that the decomposition is the same among all dimensions, so when people say "block decomposition" or "cyclic decomposition", they usually mean that across all dimensions.
A good description of this can be found on the pages .