I am looking for work with about 4,000 matrices of fixed size (3x3, 4x4), doing things like matrix inversion and eigendecomposition.
It seems to me that the best way to parallelize this is to allow each of the many GPU threads to work with one instance of the problem.
Is there any reasonable way to do this? I read: http://www.culatools.com/blog/2011/12/09/batched-operations/ , but as far as I can tell, this is always what works without a solution. Three years later, I hope there is a good solution.
So far I watched:
- Using Eigen in CUDA Kernels: http://eigen.tuxfamily.org/dox-devel/TopicCUDA.html . But this is in its infancy: thus, it does not work well, and some things are not implemented. Moreover, I'm not sure if it is optimized for CUDA. There is almost no documentary documentation, and the only example code is a test file (native / test / cuda_basic.cu). When I tried to use Eigen in CUDA kernels, simple things like declaring
Eigen::MatrixXf in the kernel did not survive compilation using nvcc V7.0.27 and Eigen 3.2.90 (mercurial). - Using the cuBLAS device API library to run BLAS routines in the kernel. It seems that cuBLAS and its ilk are written to parallelize even for small matrices that seem excessive and probably slow for the 3x3 and 4x4 matrices that interest me. Also, I'm not sure if there is something like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from the kernel).
- Batch processing cells using CUDA streams. Section 2.1.7 "Dosing Cells" of the cuBLAS documentation for CUDA Toolkit v7.0 is proposed. But "in practice, it is impossible to simultaneously execute more than 16 simultaneous cores" "", and, therefore, it would be terrible for processing 4000 small matrices. In the above link to the CULA blog post, I quote: “Theoretically, you can use the CUDA thread for each problem and run one problem at the same time. This will work poorly for two reasons: firstly, the number of threads per block would be too low; [.. .] Secondly, the overhead caused by running thousands of operations in this way would be unacceptable, since the startup code was as expensive (if not more expensive) than just executing a matrix on the CPU. " ""
- Implementation of my own matrix multiplication and eigendecomposition in kernels. This is likely to be very slow and may take a long time to implement.
At this moment, I am tempted to completely abandon this on the GPU. It's a pity, since I was hoping for real-time performance for an algorithm that requires inverting 4,000 3x3 matrices about 100 times every 0.1 seconds.
source share