This is actually relatively simple (if you already know something about blacs / scalapack and mpi-io!), But even then the documentation - even on the Internet - is the same as you found, somewhat poor.
The first thing you need to know about MPI-IO is that it allows you to use the usual MPI data types to specify each kind of “file” file, and then read only the data that falls into this view. In our center, we have slides and source code for a half-day course in parallel with IO; the first third or so is about the basics of MPI-IO. There are slides here and sample source code here .
The second thing you need to know is that MPI has a built-in way to create “distributed array” data types, one of which allows you to distribute block-cyclic data distribution; which was discussed in general terms in my answer to this question: What is the difference between darray and subarray in mpi? .
Thus, this means that if you have a binary file containing a large matrix, you can read it using mpi-io using MPI_Type_create_darray , and it will be distributed in tasks in a block-cyclic way. Then it's just a matter of calling blacs or scalapack. The table below shows an example of a program using Skypeg psvemv to multiply matrix vectors, not psgemm. in my answer to question in the exchange of stacks of calculations.
To give you an idea of how the parts join together, the following simple program is read in a binary file containing a matrix (first the size of the square matrix N, then the elements N ^ 2), and then calculates the eigenvalues and vectors using the scalapack program (new) pssyevr . It combines MPI-IO, darray and SkypePack. This is in Fortran, but function calls are the same in C.
! ! Use MPI-IO to read a diagonal matrix distributed block-cyclically, ! use Scalapack to calculate its eigenvalues, and compare ! to expected results. ! program darray use mpi implicit none integer :: n, nb ! problem size and block size integer :: myArows, myAcols ! size of local subset of global array real :: p real, dimension(:), allocatable :: myA, myZ real, dimension(:), allocatable :: work integer, dimension(:), allocatable :: iwork real, dimension(:), allocatable :: eigenvals real, dimension(:), allocatable :: expected integer :: worksize, totwork, iworksize integer, external :: numroc ! blacs routine integer :: me, procs, icontxt, prow, pcol, myrow, mycol ! blacs data integer :: info ! scalapack return value integer, dimension(9) :: ides_a, ides_z ! scalapack array desc integer :: clock real :: calctime, iotime character(len=128) :: filename integer :: mpirank integer :: ierr integer, dimension(2) :: pdims, dims, distribs, dargs integer :: infile integer, dimension(MPI_STATUS_SIZE) :: mpistatus integer :: darray integer :: locsize, nelements integer(kind=MPI_ADDRESS_KIND) :: lb, locextent integer(kind=MPI_OFFSET_KIND) :: disp integer :: nargs integer :: m, nz ! Initialize MPI (for MPI-IO) call MPI_Init(ierr) call MPI_Comm_size(MPI_COMM_WORLD,procs,ierr) call MPI_Comm_rank(MPI_COMM_WORLD,mpirank,ierr) ! May as well get the process grid from MPI_Dims_create pdims = 0 call MPI_Dims_create(procs, 2, pdims, ierr) prow = pdims(1) pcol = pdims(2) ! get filename nargs = command_argument_count() if (nargs /= 1) then print *,'Usage: darray filename' print *,' Where filename = name of binary matrix file.' call MPI_Abort(MPI_COMM_WORLD,1,ierr) endif call get_command_argument(1, filename) ! find the size of the array we'll be using call tick(clock) call MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_RDONLY, MPI_INFO_NULL, infile, ierr) call MPI_File_read_all(infile,n,1,MPI_INTEGER,mpistatus,ierr) call MPI_File_close(infile,ierr) ! create the darray that will read in the data. nb = 64 if (nb > (N/prow)) nb = N/prow if (nb > (N/pcol)) nb = N/pcol dims = [n,n] distribs = [MPI_DISTRIBUTE_CYCLIC, MPI_DISTRIBUTE_CYCLIC] dargs = [nb, nb] call MPI_Type_create_darray(procs, mpirank, 2, dims, distribs, dargs, & pdims, MPI_ORDER_FORTRAN, MPI_REAL, darray, ierr) call MPI_Type_commit(darray,ierr) call MPI_Type_size(darray, locsize, ierr) nelements = locsize/4 call MPI_Type_get_extent(darray, lb, locextent, ierr) ! Initialize local arrays allocate(myA(nelements)) allocate(myZ(nelements)) allocate(eigenvals(n), expected(n)) ! read in the data call MPI_File_open(MPI_COMM_WORLD, trim(filename), MPI_MODE_RDONLY, MPI_INFO_NULL, infile, ierr) disp = 4 ! skip N = 4 bytes call MPI_File_set_view(infile, disp, MPI_REAL, darray, "native", MPI_INFO_NULL, ierr) call MPI_File_read_all(infile, myA, nelements, MPI_REAL, mpistatus, ierr) call MPI_File_close(infile,ierr) iotime = tock(clock) if (mpirank == 0) then print *,'I/O time = ', iotime endif ! Initialize blacs processor grid call tick(clock) call blacs_pinfo (me,procs) call blacs_get (-1, 0, icontxt) call blacs_gridinit(icontxt, 'R', prow, pcol) call blacs_gridinfo(icontxt, prow, pcol, myrow, mycol) myArows = numroc(n, nb, myrow, 0, prow) myAcols = numroc(n, nb, mycol, 0, pcol) ! Construct local arrays ! Global structure: matrix A of n rows and n columns ! Prepare array descriptors for ScaLAPACK call descinit( ides_a, n, n, nb, nb, 0, 0, icontxt, myArows, info ) call descinit( ides_z, n, n, nb, nb, 0, 0, icontxt, myArows, info ) ! Call ScaLAPACK library routine allocate(work(1), iwork(1)) iwork(1) = -1 work(1) = -1. call pssyevr( 'V', 'A', 'U', n, myA, 1, 1, ides_a, -1.e20, +1.e20, 1, n, & m, nz, eigenvals, myZ, 1, 1, ides_z, work, -1, & iwork, -1, info ) worksize = int(work(1))/2*3 iworksize = iwork(1)/2*3 print *, 'Local workspace ', worksize call MPI_Reduce(worksize, totwork, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD, ierr) if (mpirank == 0) print *, ' total work space ', totwork call MPI_Reduce(iworksize, totwork, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD, ierr) if (mpirank == 0) print *, ' total iwork space ', totwork deallocate(work,iwork) allocate(work(worksize),iwork(iworksize)) call pssyev('N','U',n,myA,1,1,ides_a,eigenvals,myZ,1,1,ides_z,work,worksize,info) if (info /= 0) then print *, 'Error: info = ', info else if (mpirank == 0) then print *, 'Calculated ', m, ' eigenvalues and ', nz, ' eigenvectors.' endif ! Deallocate the local arrays deallocate(myA, myZ) deallocate(work, iwork) ! End blacs for processors that are used call blacs_gridexit(icontxt) calctime = tock(clock) ! calculated the expected eigenvalues for a particular test matrix p = 3. + sqrt((4. * n - 3.) * (n - 1.)*3./(n+1.)) expected(1) = p/(n*(5.-2.*n)) expected(2) = 6./(p*(n + 1.)) expected(3:n) = 1. ! Print results if (me == 0) then if (info /= 0) then print *, 'Error