Answering your question, "Any ideas on what I can do to speed up my GPU?"
First of all, let me present this with the statement that the proposed operation X = alpha * Y + beta * Z does not require a large amount of computational intensity for each bit of data transfer. As a result, I could not beat the processor time on this particular code. However, it may be instructive to cover 2 ideas to speed up this code:
Use a locked page for data transfer operations. This led to a reduction of about 2 times during data transfer for the GPU version, which dominated the total execution time for the GPU version.
Use the strikethrough copy method with cudaMemcpy2D suggested by @njuffa here . The result is 2 times: we can reduce the amount of data transfer only to what is needed for the calculation, and we can then rewrite the kernel to work with data, as suggested in the comments (again njuffa), This is due to an additional 3 -fold improvement in data transfer time and 10-fold improvement in kernel computation time.
This code serves as an example of these operations:
#include <stdio.h> #include <stdlib.h> #define THREADS_PER_BLOCK 1024 #define DSIZE 5000000 #define WSIZE 50000 #define XSTEP 47 #define YSTEP 43 #define ZSTEP 41 #define TOL 0.00001f #define cudaCheckErrors(msg) \ do { \ cudaError_t __err = cudaGetLastError(); \ if (__err != cudaSuccess) { \ fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \ msg, cudaGetErrorString(__err), \ __FILE__, __LINE__); \ fprintf(stderr, "*** FAILED - ABORTING\n"); \ exit(1); \ } \ } while (0) typedef float real; __global__ void vectorStepAddKernel2(real *x, real *y, real *z, real alpha, real beta, int size, int xstep, int ystep, int zstep) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < size) { x[i*xstep] = alpha* y[i*ystep] + beta*z[i*zstep]; } } __global__ void vectorStepAddKernel2i(real *x, real *y, real *z, real alpha, real beta, int size) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < size) { x[i] = alpha* y[i] + beta*z[i]; } } void vectorStepAdd2(real *x, real *y, real *z, real alpha, real beta, int size, int xstep, int ystep, int zstep) { int threadsPerBlock = THREADS_PER_BLOCK; int blocksPerGrid = (size + threadsPerBlock -1)/threadsPerBlock; vectorStepAddKernel2<<<blocksPerGrid, threadsPerBlock>>>(x, y, z, alpha, beta, size, xstep, ystep, zstep); cudaDeviceSynchronize(); cudaCheckErrors("kernel2 fail"); } void vectorStepAdd2i(real *x, real *y, real *z, real alpha, real beta, int size) { int threadsPerBlock = THREADS_PER_BLOCK; int blocksPerGrid = (size + threadsPerBlock -1)/threadsPerBlock; vectorStepAddKernel2i<<<blocksPerGrid, threadsPerBlock>>>(x, y, z, alpha, beta, size); cudaDeviceSynchronize(); cudaCheckErrors("kernel3 fail"); } //CPU function: void vectorStepAdd3(real *x, real*y, real* z, real alpha, real beta, int size, int xstep, int ystep, int zstep) { for(int i=0;i<size;i++) { x[i*xstep] = alpha* y[i*ystep] + beta*z[i*zstep]; } } int main() { real *h_x, *h_y, *h_z, *c_x, *h_x1; real *d_x, *d_y, *d_z, *d_x1, *d_y1, *d_z1; int dsize = DSIZE; int wsize = WSIZE; int xstep = XSTEP; int ystep = YSTEP; int zstep = ZSTEP; real alpha = 0.5f; real beta = 0.5f; float et; /* h_x = (real *)malloc(dsize*sizeof(real)); if (h_x == 0){printf("malloc1 fail\n"); return 1;} h_y = (real *)malloc(dsize*sizeof(real)); if (h_y == 0){printf("malloc2 fail\n"); return 1;} h_z = (real *)malloc(dsize*sizeof(real)); if (h_z == 0){printf("malloc3 fail\n"); return 1;} c_x = (real *)malloc(dsize*sizeof(real)); if (c_x == 0){printf("malloc4 fail\n"); return 1;} h_x1 = (real *)malloc(dsize*sizeof(real)); if (h_x1 == 0){printf("malloc1 fail\n"); return 1;} */ cudaHostAlloc((void **)&h_x, dsize*sizeof(real), cudaHostAllocDefault); cudaCheckErrors("cuda Host Alloc 1 fail"); cudaHostAlloc((void **)&h_y, dsize*sizeof(real), cudaHostAllocDefault); cudaCheckErrors("cuda Host Alloc 2 fail"); cudaHostAlloc((void **)&h_z, dsize*sizeof(real), cudaHostAllocDefault); cudaCheckErrors("cuda Host Alloc 3 fail"); cudaHostAlloc((void **)&c_x, dsize*sizeof(real), cudaHostAllocDefault); cudaCheckErrors("cuda Host Alloc 4 fail"); cudaHostAlloc((void **)&h_x1, dsize*sizeof(real), cudaHostAllocDefault); cudaCheckErrors("cuda Host Alloc 5 fail"); cudaMalloc((void **)&d_x, dsize*sizeof(real)); cudaCheckErrors("cuda malloc1 fail"); cudaMalloc((void **)&d_y, dsize*sizeof(real)); cudaCheckErrors("cuda malloc2 fail"); cudaMalloc((void **)&d_z, dsize*sizeof(real)); cudaCheckErrors("cuda malloc3 fail"); cudaMalloc((void **)&d_x1, wsize*sizeof(real)); cudaCheckErrors("cuda malloc4 fail"); cudaMalloc((void **)&d_y1, wsize*sizeof(real)); cudaCheckErrors("cuda malloc5 fail"); cudaMalloc((void **)&d_z1, wsize*sizeof(real)); cudaCheckErrors("cuda malloc6 fail"); for (int i=0; i< dsize; i++){ h_x[i] = 0.0f; h_x1[i] = 0.0f; c_x[i] = 0.0f; h_y[i] = (real)(rand()/(real)RAND_MAX); h_z[i] = (real)(rand()/(real)RAND_MAX); } cudaEvent_t t_start, t_stop, k_start, k_stop; cudaEventCreate(&t_start); cudaEventCreate(&t_stop); cudaEventCreate(&k_start); cudaEventCreate(&k_stop); cudaCheckErrors("event fail"); // first test original GPU version cudaEventRecord(t_start); cudaMemcpy(d_x, h_x, dsize * sizeof(real), cudaMemcpyHostToDevice); cudaCheckErrors("cuda memcpy 1 fail"); cudaMemcpy(d_y, h_y, dsize * sizeof(real), cudaMemcpyHostToDevice); cudaCheckErrors("cuda memcpy 2 fail"); cudaMemcpy(d_z, h_z, dsize * sizeof(real), cudaMemcpyHostToDevice); cudaCheckErrors("cuda memcpy 3 fail"); cudaEventRecord(k_start); vectorStepAdd2(d_x, d_y, d_z, alpha, beta, wsize, xstep, ystep, zstep); cudaEventRecord(k_stop); cudaMemcpy(h_x, d_x, dsize * sizeof(real), cudaMemcpyDeviceToHost); cudaCheckErrors("cuda memcpy 4 fail"); cudaEventRecord(t_stop); cudaEventSynchronize(t_stop); cudaEventElapsedTime(&et, t_start, t_stop); printf("GPU original version total elapsed time is: %f ms.\n", et); cudaEventElapsedTime(&et, k_start, k_stop); printf("GPU original kernel elapsed time is: %f ms.\n", et); //now test CPU version cudaEventRecord(t_start); vectorStepAdd3(c_x, h_y, h_z, alpha, beta, wsize, xstep, ystep, zstep); cudaEventRecord(t_stop); cudaEventSynchronize(t_stop); cudaEventElapsedTime(&et, t_start, t_stop); printf("CPU version total elapsed time is: %f ms.\n", et); for (int i = 0; i< dsize; i++) if (fabsf((float)(h_x[i]-c_x[i])) > TOL) { printf("cpu/gpu results mismatch at i = %d, cpu = %f, gpu = %f\n", i, c_x[i], h_x[i]); return 1; } // now test improved GPU version cudaEventRecord(t_start); // cudaMemcpy2D(d_x1, sizeof(real), h_x, xstep * sizeof(real), sizeof(real), wsize, cudaMemcpyHostToDevice); // cudaCheckErrors("cuda memcpy 5 fail"); cudaMemcpy2D(d_y1, sizeof(real), h_y, ystep * sizeof(real), sizeof(real), wsize, cudaMemcpyHostToDevice); cudaCheckErrors("cuda memcpy 6 fail"); cudaMemcpy2D(d_z1, sizeof(real), h_z, zstep * sizeof(real), sizeof(real), wsize, cudaMemcpyHostToDevice); cudaCheckErrors("cuda memcpy 7 fail"); cudaEventRecord(k_start); vectorStepAdd2i(d_x1, d_y1, d_z1, alpha, beta, wsize); cudaEventRecord(k_stop); cudaMemcpy2D(h_x1, xstep*sizeof(real), d_x1, sizeof(real), sizeof(real), wsize, cudaMemcpyDeviceToHost); cudaCheckErrors("cuda memcpy 8 fail"); cudaEventRecord(t_stop); cudaEventSynchronize(t_stop); cudaEventElapsedTime(&et, t_start, t_stop); printf("GPU improved version total elapsed time is: %f ms.\n", et); cudaEventElapsedTime(&et, k_start, k_stop); printf("GPU improved kernel elapsed time is: %f ms.\n", et); for (int i = 0; i< dsize; i++) if (fabsf((float)(h_x[i]-h_x1[i])) > TOL) { printf("gpu/gpu improved results mismatch at i = %d, gpu = %f, gpu imp = %f\n", i, h_x[i], h_x1[i]); return 1; } printf("Results:i CPU GPU GPUi \n"); for (int i = 0; i< 20*xstep; i+=xstep) printf(" %d %f %f %f %f %f\n",i, c_x[i], h_x[i], h_x1[i]); return 0; }
As already mentioned, I still could not beat the processor time, and I explain this either by my own lack of coding skills, or by the fact that this operation basically does not have sufficient computational complexity to be interesting on the GPU. However, here are some examples of results:
GPU original version total elapsed time is: 13.352256 ms. GPU original kernel elapsed time is: 0.195808 ms. CPU version total elapsed time is: 2.599584 ms. GPU improved version total elapsed time is: 4.228288 ms. GPU improved kernel elapsed time is: 0.027392 ms. Results:i CPU GPU GPUi 0 0.617285 0.617285 0.617285 47 0.554522 0.554522 0.554522 94 0.104245 0.104245 0.104245 ....
We see that the improved core had a total decrease of about 3 times compared to the original core, almost all of which were associated with a reduction in the time for copying data. This reduction in data copying time was due to the fact that with improved 2D memcpy we only need to copy the data that we actually use. (without memory with page entries, this data transfer time will be twice as long, approximately). We can also see that kernel computation time is about 10 times faster than CPU calculation for the original kernel, and about 100 times faster than CPU calculation for the improved kernel. However, given the time of data transfer, we cannot overcome the processor speed.
One recent comment is that the βcostβ of cudaMemcpy2D operation is still quite high. To reduce 100x in the size of the vector, we see only a 3-fold reduction in time for copying. Thus, striped access still makes a relatively expensive way to use the GPU. If we simply transferred vectors of 50,000 adjacent elements, we would expect an almost linear reduction of 100 times during copying (compared to the original copy vectors of 5,000,000 elements). This means that the copy time will be less than 1 ms, and our version of the GPU will be faster than the processor, at least this naive single-threaded processor code.