In cuda mode, loading to shared memory is slower than loading registers

Question

In cuda mode, loading to shared memory is slower than loading registers

I am not an experienced CUDA programmer. I have such a problem. I am trying to load a fragment (32x32) of a large matrix (10K * 10K) from global memory into shared memory, and I synchronize it when this happens. I realized that if I load it into private memory (registers), it loads 4-5 times faster than loading shared memory.

__global__ void speedtest( float *vel,int nx) { int globalx = blockDim.x * blockIdx.x + threadIdx.x+pad; int globalz = blockDim.y * blockIdx.y + threadIdx.y+pad; int localx=threadIdx.x; int localz=threadIdx.y; float ptest; __shared__ float stest[tile][tile]; //stest[localz][localx]=vel[globalz*nx+globalx]; //load to shared memory ptest=vel[globalz*nx+globalx]; //load to private memory __syncthreads(); }

I will comment on stest and ptest one by one and count the elapsed time with cudaeventrecord. stest took 3.2 ms and ptest took 0.75 ms to boot. What am I doing wrong? Should the dates be very similar? What am I missing?

Configuration: Cuda 7.5, gtx 980, only 32-bit variables and calculations, no specific goal is assumed, I just play with it.

I am sending sample code upon request

 #include<stdio.h> #include <math.h> #define tile 32 #include <helper_cuda.h> void makeittwo(float *array,int nz,int nx) { //this just assigns a number into the vector int n2; n2=nx*nz; for (int i=0;i<n2;i++) array[i]=2000; } __global__ void speedtest( float *vel,int nx,int nz) { int globalx = blockDim.x * blockIdx.x + threadIdx.x; int globalz = blockDim.y * blockIdx.y + threadIdx.y; int localx=threadIdx.x; int localz=threadIdx.y; float ptest; //declarations __shared__ float stest[tile][tile]; if (globalx<nx && globalz<nz){ stest[localz][localx]=vel[globalz*nx+globalx]; //shared variable //ptest=vel[globalz*nx+globalx]; //private variable //comment out ptest and stest one by one to test them } __syncthreads(); } int main(int argc,char *argv) { int nx,nz,N; float *vel; nz=10000;nx=10000; //matrix dimensions N=nz*nx; //convert matrix into vector checkCudaErrors(cudaMallocHost(&vel,sizeof(float)*N)); //using pinned memory makeittwo(vel,nz,nx); dim3 dimBlock(tile,tile); dim3 dimGrid; int blockx=dimBlock.x; int blockz=dimBlock.y; dimGrid.x = (nx + blockx - 1) / (blockx); dimGrid.y = (nz + blockz - 1) / (blockz); float *d_vel; checkCudaErrors(cudaMalloc(&d_vel,sizeof(float)*(N))); //copying to device checkCudaErrors(cudaMemcpy(d_vel, vel, sizeof(float)*(N), cudaMemcpyHostToDevice)); cudaEvent_t start,stop; float elapsedTime; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0); speedtest<<<dimGrid,dimBlock>>>(d_vel,nx,nz); //calling the function cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsedTime,start,stop); printf("time=%3.3f ms\n",elapsedTime); checkCudaErrors(cudaMemcpy(vel, d_vel, sizeof(float)*N, cudaMemcpyDeviceToHost)); //calling the matrix back to check if all went well (this fails if out of bound calls are made) cudaDeviceReset(); }

+4

performance c shared-memory cuda

coercion Aug 13 '15 at 10:38

source share

1 answer

havogt · Accepted Answer · 2015-08-13T14:46:21+0000

The sample code does not actually measure what the OP expects, because some instructions are optimized by the compiler.

In the example, a local variable ( ptest ) loading does not affect the state outside the kernel. In this case, the compiler can completely remove the instruction. This can be seen in the SASS code. The SASS code is the same when ptest=vel[globalz*nx+globalx]; active, or both statements (ptest and stest) are deleted. To check the SASS code, you can run cuobjdump --dump-sass in the object file.

The instructions apparently are not optimized in the shared memory example, which can be checked in SASS code. (In fact, I expected the instructions to be deleted as well. Are there any side effects that are missing?)

As already discussed in the comments, with a simple calculation ( ptest*=ptest ) and writing to global memory, the compiler cannot delete the instruction because it changes the global state.

From the OP comments, I assume that there is a misunderstanding in how the load operation in shared memory works. In fact, data is loaded from the global memory into registers, and then stored in shared memory . The created (corresponding) SASS instructions (for sm_30) look like this:

 LD.E R2, [R6]; // load to register R2 STS [R0], R2; // store from register R2 to shared memory

The following example, multiply and save in global memory, demonstrates another case where the compiler does not generate code that you might naively expect:

 stest[localz][localx]=vel[globalz*nx+globalx]; // load to shared memory stest[localz][localx]*=stest[localz][localx]; // multiply vel[globalz*nx+globalx]=stest[localz][localx]; // save to global memory

The SASS code indicates that the variable is only stored in shared memory after calculation (and never reads the shared shape memory).

 LD.E R2, [R6]; // load to register FMUL R0, R2, R2; // multiply STS [R3], R0; // store the result in shared memory ST.E [R6], R0; // store the result in global memory

I am not an expert in SASS code, please correct me if I am wrong or left something important.

In cuda mode, loading to shared memory is slower than loading registers

More articles: