I am trying to create a ptx module to implement the CUBLAS function in order to answer this currently unresolved SO question. I want to be able to define a function that can then be executed using launch() or some similar utility.
As a basic guide, I look at this page , which gives examples from a script for calling CUBLAS functions. I also review this example from the CUDArt GitHub website for more information. My current job looks something like this:
#include <cublas_v2.h> extern "C" // Multiply the arrays A and B on GPU and save the result in C // C(m,n) = A(m,k) * B(k,n) __device__ void gpu_blas_mmul(const float *A, const float *B, float *C, const int m, const int k, const int n) { int lda=m,ldb=k,ldc=m; const float alf = 1; const float bet = 0; const float *alpha = &alf; const float *beta = &bet; // Create a handle for CUBLAS cublasHandle_t handle; cublasCreate(&handle); // Do the actual multiplication cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc); // Destroy the handle cublasDestroy(handle); }
Then I compile and then check the compilation using something like this:
nvcc -ptx -gencode=arch=compute_35,code=sm_35 -lcublas gpu_blas_mmul.cu ptxas -arch=sm_35 gpu_blas_mmul.ptx
When I do this, get the following error:
Unresolved extern function 'cublasCreate_v2'
If I remove __device__ from the beginning of the script, I will no longer get the error. But when I try to load a function into CUDArt using:
using CUDArt md = CuModule("path/to/gpu_blas_mmul.ptx", false) gpu_blas_mmul = CuFunction(md, "gpu_blas_mmul")
I get an error message:
ERROR: Named symbol not found
I looked at this and this SO post, as well as this resource. I tried simple solutions in them, for example, using __deice__ __host__ in my script and -nc when compiling with nvcc . However, I did not delve into the articles. This is partly due to the fact that they describe much more complex situations using several scenarios related to each other. It seems more complicated than I think is necessary, and besides, I'm not even sure that it will be successful.
I wrote other kernels and successfully compiled and launched them using CUDArt before, but the one that uses the cuBLAS library seems to beat me.
How can I solve these problems in order to compile and run the function specified in my script? I do not understand what part of this process is being destroyed or even if I am approaching this problem at the proper level.
Notes: I also tried replacing __device__ with __global__ and compiling using all the different architecture options, and none of them solves the problem.