There is no formal algorithm for this; in general, these kinds of linear algebra operations, in which the whole problem is not stored in memory at the same time, are called "outside the kernel."
To solve this problem, you do not need a particularly complex algorithm, just a CUBLAS library and pencil and paper. For example, you can decompose a matrix product as follows:

which gives you four independent submatrix multiplication operations. They can be calculated using four calls to the CUBLAS gemm using a very simple host code. You can spread the idea to as many submatrices as needed to fit the size of the problem and the capacity of your GPU. The same principle can also be used to implement matrix multiplication tasks on multiple GPUs (see this question for an example).
Alternatively, you can find a working implementation of this exact idea at Harvard, developed by HPL-CUDA linpack implementation (disclaimer: I am associated with the latest codebase).
source share