Work with many fixed-size matrices in CUDA cores

Question

Work with many fixed-size matrices in CUDA cores

I am looking for work with about 4,000 matrices of fixed size (3x3, 4x4), doing things like matrix inversion and eigendecomposition.

It seems to me that the best way to parallelize this is to allow each of the many GPU threads to work with one instance of the problem.

Is there any reasonable way to do this? I read: http://www.culatools.com/blog/2011/12/09/batched-operations/ , but as far as I can tell, this is always what works without a solution. Three years later, I hope there is a good solution.

So far I watched:

Using Eigen in CUDA Kernels: http://eigen.tuxfamily.org/dox-devel/TopicCUDA.html . But this is in its infancy: thus, it does not work well, and some things are not implemented. Moreover, I'm not sure if it is optimized for CUDA. There is almost no documentary documentation, and the only example code is a test file (native / test / cuda_basic.cu). When I tried to use Eigen in CUDA kernels, simple things like declaring Eigen::MatrixXf in the kernel did not survive compilation using nvcc V7.0.27 and Eigen 3.2.90 (mercurial).
Using the cuBLAS device API library to run BLAS routines in the kernel. It seems that cuBLAS and its ilk are written to parallelize even for small matrices that seem excessive and probably slow for the 3x3 and 4x4 matrices that interest me. Also, I'm not sure if there is something like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from the kernel).
Batch processing cells using CUDA streams. Section 2.1.7 "Dosing Cells" of the cuBLAS documentation for CUDA Toolkit v7.0 is proposed. But "in practice, it is impossible to simultaneously execute more than 16 simultaneous cores" "", and, therefore, it would be terrible for processing 4000 small matrices. In the above link to the CULA blog post, I quote: “Theoretically, you can use the CUDA thread for each problem and run one problem at the same time. This will work poorly for two reasons: firstly, the number of threads per block would be too low; [.. .] Secondly, the overhead caused by running thousands of operations in this way would be unacceptable, since the startup code was as expensive (if not more expensive) than just executing a matrix on the CPU. " ""
Implementation of my own matrix multiplication and eigendecomposition in kernels. This is likely to be very slow and may take a long time to implement.

At this moment, I am tempted to completely abandon this on the GPU. It's a pity, since I was hoping for real-time performance for an algorithm that requires inverting 4,000 3x3 matrices about 100 times every 0.1 seconds.

+5

c ++ matrix gpgpu cuda

Daniel Apr 05 '15 at 7:27

source share

1 answer

Robert Crovella · Accepted Answer · 2015-04-05T13:34:55+0000

The cublas getrfBatched and getriBatched functions are designed for batch inversion of small matrices. It should be faster than dynamic parallelism or threads (your 2nd and 3rd approaches.) Also in source mode, a package solver is available that can perform matrix inversions. To access this link, you will need to be logged in as a registered developer at developer.nvidia.com.

Also, I'm not sure if there is something like cuBLAS that can also perform eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from the kernel).

Cusolver provides some features of its own solvers . However, they are not packaged and cannot be called from the device code, so you are faced with threads as the only option besides this.

Work with many fixed-size matrices in CUDA cores

More articles: