Add scalar to vector in BLAS (cuBLAS / CUDA)

I donโ€™t know if Iโ€™m just losing sight of something obvious, but despite the fact that I was looking for it, I donโ€™t see the possibility of just adding a scalar to a vector (or matrix) using BLAS operations. I am trying to do this in cuBLAS / CUDA, so I will do any way to do this within this framework. BLAS is <t>scal for scalar multiplication ( cublas<t>scal ), but wherein the analog for addition ?! That is something like of GSL gsl_vector_add_constant . What am I missing?

+4
source share
2 answers

Probably the only way to do what you ask - does this apply axpy with the unit vector of the same size, scaled by a constant that you want to add.

Thus, the operation becomes X <- X + alpha * I , which is equivalent to adding alpha to each entry in X .


EDIT:

From the comments, it seems that you have provided some difficulties with creating a unit vector for calling SAXPY. One way to do this - use memset call to set the values โ€‹โ€‹of a unit vector on a device like this:

 #include "cuda.h" #include "cuda_runtime_api.h" #include "cublas_v2.h" #include <iostream> int main(void) { const int N = 10; const size_t sz = sizeof(float) * size_t(N); float *A, *I; float Ah[N] = { 0., 1., 2., 3., 4., 5., 6., 7., 8., 9. }; cudaMalloc((void **)&A, sz); cudaMemcpy(A, &Ah[0], sz, cudaMemcpyHostToDevice); // this creates a bit pattern for a single precision unity value // and uses 32-bit memset from the driver API to set the values in the // vector. const float one = 1.0f; const int* one_bits = reinterpret_cast<const int*>(&one); cudaMalloc((void **)&I, sz); cuMemsetD32(CUdeviceptr(I), *one_bits, N); cublasHandle_t h; cublasCreate(&h); const float alpha = 5.0f; cublasSaxpy(h, N, &alpha, I, 1, A, 1); cudaMemcpy(&Ah[0], A, sz, cudaMemcpyDeviceToHost); for(int i=0; i<N; i++) { std::cout << i << " " << Ah[i] << std::endl; } cublasDestroy(h); cudaDeviceReset(); return 0; } , sz); #include "cuda.h" #include "cuda_runtime_api.h" #include "cublas_v2.h" #include <iostream> int main(void) { const int N = 10; const size_t sz = sizeof(float) * size_t(N); float *A, *I; float Ah[N] = { 0., 1., 2., 3., 4., 5., 6., 7., 8., 9. }; cudaMalloc((void **)&A, sz); cudaMemcpy(A, &Ah[0], sz, cudaMemcpyHostToDevice); // this creates a bit pattern for a single precision unity value // and uses 32-bit memset from the driver API to set the values in the // vector. const float one = 1.0f; const int* one_bits = reinterpret_cast<const int*>(&one); cudaMalloc((void **)&I, sz); cuMemsetD32(CUdeviceptr(I), *one_bits, N); cublasHandle_t h; cublasCreate(&h); const float alpha = 5.0f; cublasSaxpy(h, N, &alpha, I, 1, A, 1); cudaMemcpy(&Ah[0], A, sz, cudaMemcpyDeviceToHost); for(int i=0; i<N; i++) { std::cout << i << " " << Ah[i] << std::endl; } cublasDestroy(h); cudaDeviceReset(); return 0; } 

Note. Here, I allocated and copied memory for CUBLAS vectors using the CUDA API, rather than using the CUBLAS helper functions (which in any case are very thin shells around the runtime APIs). The "difficult" part creates a bit pattern and uses the driver API function to set each 32-bit word in the array.

You can equally accomplish all of this with a pair of pattern lines of code from the thrust of the library, or just write your own kernel, which can be as simple as

 template<typename T> __global__ void vector_add_constant( T * vector, const T scalar, int N) { int tidx = threadIdx.x + blockIdx.x*blockDim.x; int stride = blockDim.x * gridDim.x; for(; tidx < N; tidx += stride) { vector[tidx] += scalar; } } ; template<typename T> __global__ void vector_add_constant( T * vector, const T scalar, int N) { int tidx = threadIdx.x + blockIdx.x*blockDim.x; int stride = blockDim.x * gridDim.x; for(; tidx < N; tidx += stride) { vector[tidx] += scalar; } } 

[disclaimer: this core was written in a browser and not verified. Use as own risk]

+3
source

Four options, from best to worst:

  • Find the desired feature in another library
  • Implement the function you need
  • Highlight and initialize a constant vector, use it with *axpy .
  • Although steps from scratch are not formally supported by BLAS, some implementations treat a vector with step 0 as a "scalar" in the sense that you want. Maybe cuBLAS. However, depending on this, this is a really bad idea (so bad that I definitely did not mention it), since this behavior is not supported by BLAS; your code will not be tolerated, and it may even be damaged by future versions of the library, if nvidia will not make a more robust API guarantee than BLAS.
+2
source

All Articles