Julia: CUSPARSE Parallel Computing on Multiple GPUs

I have n separate GPUs, each of which stores its own data. I would like each of them to perform a set of calculations at the same time. The CUDArt documentation here describes the use of threads to asynchronously invoke custom C kernels to achieve parallelization (see also this other example here ). With custom kernels, this can be achieved with the stream argument in the CUDArt implementation of the launch() function. However, as far as I can tell, CUSPARSE (or CUBLAS) functions do not have a similar option for stream specification.

Is this possible with CUSPARSE, or do I just need to plunge into C if I want to use multiple GPUs?

REVISED Bounty Update

So, I now have a relatively decent solution that works, finally. But I’m sure that it could be improved by a million - now it’s pretty hacked. In particular, I would like suggestions for solutions according to what I tried and wrote in this SO question (which I never worked properly). Thus, I would be happy to reward generosity to everyone who has additional ideas here.

+5
source share
2 answers

So, I think that finally I came to something that works at least relatively well. I would still be very happy to offer Bounty, although for those who have further improvements. In particular, improvements based on the design that I tried to implement (but failed) as described in this would be good. But, any improvements or suggestions on this subject, and I would be happy to give generosity.

The key breakthrough I discovered in order to use CUSPARSE and CUBLAS to parallelize multiple GPUs is that you need to create a separate descriptor for each GPU. For instance. from the documentation in the CUBLAS API:

The application must initialize the descriptor in the context of the cuBLAS library by calling the cublasCreate () function. Then it is explicitly passed to each subsequent call to the library function. When the application finishes working with the library, it must call the cublasDestory () function to release the resources associated with the cuBLAS library context.

This approach allows the user to explicitly control library settings when using multiple host threads and multiple GPUs . For example, an application can use cudaSetDevice () to combine different devices with different host streams and in each of these host streams can initialize a unique cuBLAS library context handle that will use the specific device associated with this host stream. Then , cuBLAS library library calls made using another descriptor will automatically send the calculations to different devices.

(in italics)

See here and here for some additional helpful documents.

Now, in order to actually move on, I had to make a bunch of pretty dirty hacking. In the future, I hope to contact the people who developed the CUSPARSE and CUBLAS packages to see how to include them in their packages. At the moment, this is what I have done:

Firstly, the CUSPARSE and CUBLAS packages have functions for creating descriptors. But I had to modify the packages a bit to export these functions (along with the necessary other functions and object types) so that I could access them myself.

In particular, I added the following to CUSPARSE.jl :

 export libcusparse, SparseChar 

to libcusparse_types.jl following:

 export cusparseHandle_t, cusparseOperation_t, cusparseMatDescr_t, cusparseStatus_t 

to libcusparse.jl following:

 export cusparseCreate 

and sparse.jl following:

 export getDescr, cusparseop 

Through all this, I was able to get functional access to the cusparseCreate() function, which can be used to create new descriptors (I could not just use CUSPARSE.cusparseCreate() , because this function depended on many other functions and data types). From there, I defined the new version of the matrix multiplication operation that I wanted, and required an additional Handle argument to feed ccall() to the CUDA driver. Below is the full code:

 using CUDArt, CUSPARSE ## note: modified version of CUSPARSE, as indicated above. N = 10^3; M = 10^6; p = 0.1; devlist = devices(dev->true); nGPU = length(devlist) dev_X = Array(CudaSparseMatrixCSR, nGPU) dev_b = Array(CudaArray, nGPU) dev_c = Array(CudaArray, nGPU) Handles = Array(Array{Ptr{Void},1}, nGPU) for (idx, dev) in enumerate(devlist) println("sending data to device $dev") device(dev) ## switch to given device dev_X[idx] = CudaSparseMatrixCSR(sprand(N,M,p)) dev_b[idx] = CudaArray(rand(M)) dev_c[idx] = CudaArray(zeros(N)) Handles[idx] = cusparseHandle_t[0] cusparseCreate(Handles[idx]) end function Pmv!( Handle::Array{Ptr{Void},1}, transa::SparseChar, alpha::Float64, A::CudaSparseMatrixCSR{Float64}, X::CudaVector{Float64}, beta::Float64, Y::CudaVector{Float64}, index::SparseChar) Mat = A cutransa = cusparseop(transa) m,n = Mat.dims cudesc = getDescr(A,index) device(device(A)) ## necessary to switch to the device associated with the handle and data for the ccall ccall( ((:cusparseDcsrmv),libcusparse), cusparseStatus_t, (cusparseHandle_t, cusparseOperation_t, Cint, Cint, Cint, Ptr{Float64}, Ptr{cusparseMatDescr_t}, Ptr{Float64}, Ptr{Cint}, Ptr{Cint}, Ptr{Float64}, Ptr{Float64}, Ptr{Float64}), Handle[1], cutransa, m, n, Mat.nnz, [alpha], &cudesc, Mat.nzVal, Mat.rowPtr, Mat.colVal, X, [beta], Y ) end function test(Handles, dev_X, dev_b, dev_c, idx) Pmv!(Handles[idx], 'N', 1.0, dev_X[idx], dev_b[idx], 0.0, dev_c[idx], 'O') device(idx-1) return to_host(dev_c[idx]) end function test2(Handles, dev_X, dev_b, dev_c) @sync begin for (idx, dev) in enumerate(devlist) @async begin Pmv!(Handles[idx], 'N', 1.0, dev_X[idx], dev_b[idx], 0.0, dev_c[idx], 'O') end end end Results = Array(Array{Float64}, nGPU) for (idx, dev) in enumerate(devlist) device(dev) Results[idx] = to_host(dev_c[idx]) ## to_host doesn't require setting correct device first. But, it is quicker if you do this. end return Results end ## Function times given after initial run for compilation @time a = test(Handles, dev_X, dev_b, dev_c, 1); ## 0.010849 seconds (12 allocations: 8.297 KB) @time b = test2(Handles, dev_X, dev_b, dev_c); ## 0.011503 seconds (68 allocations: 19.641 KB) # julia> a == b[1] # true 
+4
source

One small improvement would be to wrap the ccall expression in the validation function so that you get output when the CUDA call returns errors.

+1
source

All Articles