Big matrix multiplication by gpu

I need to implement matrix multiplication by GPU with CUDA for large matrices. The size of each matrix is ​​larger than the GPU memory. So I think I need an algorithm to do this efficiently. I surfed the internet but could not find. Can anyone give me a name or a link to such algorithms.

thanks

+6
source share
1 answer

There is no formal algorithm for this; in general, these kinds of linear algebra operations, in which the whole problem is not stored in memory at the same time, are called "outside the kernel."

To solve this problem, you do not need a particularly complex algorithm, just a CUBLAS library and pencil and paper. For example, you can decompose a matrix product as follows:

enter image description here

which gives you four independent submatrix multiplication operations. They can be calculated using four calls to the CUBLAS gemm using a very simple host code. You can spread the idea to as many submatrices as needed to fit the size of the problem and the capacity of your GPU. The same principle can also be used to implement matrix multiplication tasks on multiple GPUs (see this question for an example).

Alternatively, you can find a working implementation of this exact idea at Harvard, developed by HPL-CUDA linpack implementation (disclaimer: I am associated with the latest codebase).

+13
source

All Articles