So, I think that finally I came to something that works at least relatively well. I would still be very happy to offer Bounty, although for those who have further improvements. In particular, improvements based on the design that I tried to implement (but failed) as described in this would be good. But, any improvements or suggestions on this subject, and I would be happy to give generosity.
The key breakthrough I discovered in order to use CUSPARSE and CUBLAS to parallelize multiple GPUs is that you need to create a separate descriptor for each GPU. For instance. from the documentation in the CUBLAS API:
The application must initialize the descriptor in the context of the cuBLAS library by calling the cublasCreate () function. Then it is explicitly passed to each subsequent call to the library function. When the application finishes working with the library, it must call the cublasDestory () function to release the resources associated with the cuBLAS library context.
This approach allows the user to explicitly control library settings when using multiple host threads and multiple GPUs . For example, an application can use cudaSetDevice () to combine different devices with different host streams and in each of these host streams can initialize a unique cuBLAS library context handle that will use the specific device associated with this host stream. Then , cuBLAS library library calls made using another descriptor will automatically send the calculations to different devices.
(in italics)
See here and here for some additional helpful documents.
Now, in order to actually move on, I had to make a bunch of pretty dirty hacking. In the future, I hope to contact the people who developed the CUSPARSE and CUBLAS packages to see how to include them in their packages. At the moment, this is what I have done:
Firstly, the CUSPARSE and CUBLAS packages have functions for creating descriptors. But I had to modify the packages a bit to export these functions (along with the necessary other functions and object types) so that I could access them myself.
In particular, I added the following to CUSPARSE.jl :
export libcusparse, SparseChar
to libcusparse_types.jl following:
export cusparseHandle_t, cusparseOperation_t, cusparseMatDescr_t, cusparseStatus_t
to libcusparse.jl following:
export cusparseCreate
and sparse.jl following:
export getDescr, cusparseop
Through all this, I was able to get functional access to the cusparseCreate() function, which can be used to create new descriptors (I could not just use CUSPARSE.cusparseCreate() , because this function depended on many other functions and data types). From there, I defined the new version of the matrix multiplication operation that I wanted, and required an additional Handle argument to feed ccall() to the CUDA driver. Below is the full code:
using CUDArt, CUSPARSE