As in many cases in high-performance computing, the key to understanding performance here is understanding memory usage.
If you use one stream to do one multiplication, then for this stream you need to extract two pieces of data from memory, multiply them, and then make a certain amount of logarithmic number of additions. That three memory accesses for multi and addition and bit - arithmetic intensity is very low. The good news is that in this way there are many tasks with such tasks, each of which requires only a small amount of memory / registers, which is good for employment; but the coefficient of access to working with memory is small.
A simple single thread performing a one-point approach to the product has the same problem: for each multiplication, two memory accesses are required to load. The good news is that there is only one store for global memory for the entire point product, and you avoid binary reduction, which also does not scale and requires a lot of synchronization; the downside there is less than the threads that at least your (b) approach worked for you.
Now you know that there must be some way to do more operations to access memory than this; for NxN square matrices, there N ^ 3 work to do the multiplication, but only 3xN ^ 2 elements - so you should be able to find a way to do more than 1 calculation on 2 memory access.
The approach made in the CUDA SDK is the best way - the matrices are broken into tiles, and your (b) approach is used - one stream per output element. But the key is how the threads are arranged. Pulling all small submatrices from slow global memory into shared memory and performing calculations from there, you can do many multiplications and add to each number that you read from memory. This approach is the most successful approach in many applications because receiving data - whether over the network or from the main memory for the central processor or off-chip access for the GPU - often takes much longer than processing the data.
There are documents on the NVidia CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) that describe their example SDK very well.
Jonathan dursi
source share