Matrix multiplication in cuda

Let's say I want to multiply two matrices together, 50 by 50. I have two ways to arrange flows and blocks.

a) one thread for calculating each element of the result matrix. Therefore, my loop in the stream multiplies one row and one column.

b) one stream for each multiplication. Each element of the result matrix requires 50 threads. After completing the multiplication, I can use binary reduction to summarize the results.

I was not sure where to go, so I took b. It was not perfect. Actually it was slow. Any idea why? My guess would be too many threads and they are waiting for the resource most of the time, is that true?

+3
source share
4 answers

As in many cases in high-performance computing, the key to understanding performance here is understanding memory usage.

If you use one stream to do one multiplication, then for this stream you need to extract two pieces of data from memory, multiply them, and then make a certain amount of logarithmic number of additions. That three memory accesses for multi and addition and bit - arithmetic intensity is very low. The good news is that in this way there are many tasks with such tasks, each of which requires only a small amount of memory / registers, which is good for employment; but the coefficient of access to working with memory is small.

A simple single thread performing a one-point approach to the product has the same problem: for each multiplication, two memory accesses are required to load. The good news is that there is only one store for global memory for the entire point product, and you avoid binary reduction, which also does not scale and requires a lot of synchronization; the downside there is less than the threads that at least your (b) approach worked for you.

Now you know that there must be some way to do more operations to access memory than this; for NxN square matrices, there N ^ 3 work to do the multiplication, but only 3xN ^ 2 elements - so you should be able to find a way to do more than 1 calculation on 2 memory access.

The approach made in the CUDA SDK is the best way - the matrices are broken into tiles, and your (b) approach is used - one stream per output element. But the key is how the threads are arranged. Pulling all small submatrices from slow global memory into shared memory and performing calculations from there, you can do many multiplications and add to each number that you read from memory. This approach is the most successful approach in many applications because receiving data - whether over the network or from the main memory for the central processor or off-chip access for the GPU - often takes much longer than processing the data.

There are documents on the NVidia CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) that describe their example SDK very well.

+4
source

You looked at the CUDA documentation: Cuda programming model

Also, an example source code: Matrix Multiplication

+3
source

You looked

$SDK/nvidia-gpu-sdk-3.1/C/src/matrixMul 

i.e. matrix multiplication example in SDK?

+1
source

If you do not need to implement this yourself, just use the library - CUBLAS, MAGMA, etc., providing realistic implementations of matrix multiplication.

0
source

All Articles