The difference between a program that uses read-only memory and global memory

I have two programs. the only difference is that read-only memory is used, while the other is global. I want to know why global memory is faster than read-only memory? Both of them compute the dot matrix btw 2 dot

#include<cuda_runtime.h> #include<cuda.h> #include<stdio.h> #include<stdlib.h> #define intMin(a,b) ((a<b)?a:b) //Threads per block #define TPB 128 //blocks per grid #define BPG intMin(128, ((n+TPB-1)/TPB)) const int n = 4; __constant__ float deva[n],devb[n]; __global__ void addVal( float *c){ int tid = blockIdx.x * blockDim.x + threadIdx.x; //Using shared memory to temporary store results __shared__ float cache[TPB]; float temp = 0; while(tid < n){ temp += deva[tid] * devb[tid]; tid += gridDim.x * blockDim.x; } cache[threadIdx.x] = temp; __syncthreads(); int i = blockDim.x/2; while( i !=0){ if(threadIdx.x < i){ cache[threadIdx.x] = cache[threadIdx.x] +cache[threadIdx.x + i] ; } __syncthreads(); i = i/2; } if(threadIdx.x == 1){ c[blockIdx.x ] = cache[0]; } } int main(){ float a[n] , b[n] , c[BPG]; //float *deva, *devb, *devc; float *devc; int i; //Filling with random values to test for( i =0; i< n; i++){ a[i] = i; b[i] = i*2; } //cudaMalloc((void**)&deva, n * sizeof(float)); //cudaMalloc((void**)&devb, n * sizeof(float)); cudaMalloc((void**)&devc, BPG * sizeof(float)); //cudaMemcpy(deva, a, n *sizeof(float), cudaMemcpyHostToDevice); //cudaMemcpy(devb, b, n *sizeof(float), cudaMemcpyHostToDevice); cudaMemcpyToSymbol(deva, a, n * sizeof(float)); cudaMemcpyToSymbol(devb, b, n * sizeof(float)); cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); //Call function to do dot product addVal<<<BPG, TPB>>>( devc); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float time; cudaEventElapsedTime(&time,start, stop); printf("The elapsed time is: %f\n", time); //copy result back cudaMemcpy(c, devc, BPG * sizeof(float), cudaMemcpyDeviceToHost); float sum =0 ; for ( i = 0 ; i< BPG; i++){ sum+=c[i]; } //display answer printf("%f\n",sum); getchar(); return 0; } 

The following is the global memory version.

 #include<cuda_runtime.h> #include<cuda.h> #include<stdio.h> #include<stdlib.h> #define intMin(a,b) ((a<b)?a:b) //Threads per block #define TPB 128 //blocks per grid #define BPG intMin(128, ((n+TPB-1)/TPB)) const int n = 4; __global__ void addVal(float *a, float *b, float *c){ int tid = blockIdx.x * blockDim.x + threadIdx.x; //Using shared memory to temporary store results __shared__ float cache[TPB]; float temp = 0; while(tid < n){ temp += a[tid] * b[tid]; tid += gridDim.x * blockDim.x; } cache[threadIdx.x] = temp; __syncthreads(); int i = blockDim.x/2; while( i !=0){ if(threadIdx.x < i){ cache[threadIdx.x] = cache[threadIdx.x] +cache[threadIdx.x + i] ; } __syncthreads(); i = i/2; } if(threadIdx.x == 1){ c[blockIdx.x ] = cache[0]; } } int main(){ float a[n] , b[n] , c[BPG]; float *deva, *devb, *devc; int i; //Filling with random values to test for( i =0; i< n; i++){ a[i] = i; b[i] = i*2; } printf("Not using constant memory\n"); cudaMalloc((void**)&deva, n * sizeof(float)); cudaMalloc((void**)&devb, n * sizeof(float)); cudaMalloc((void**)&devc, BPG * sizeof(float)); cudaMemcpy(deva, a, n *sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devb, b, n *sizeof(float), cudaMemcpyHostToDevice); cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); //Call function to do dot product addVal<<<BPG, TPB>>>(deva, devb, devc); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float time; cudaEventElapsedTime(&time,start, stop); printf("The elapsed time is: %f\n", time); //copy result back cudaMemcpy(c, devc, BPG * sizeof(float), cudaMemcpyDeviceToHost); float sum =0 ; for ( i = 0 ; i< BPG; i++){ sum+=c[i]; } //display answer printf("%f\n",sum); getchar(); return 0; } 
+4
source share
1 answer

You do not use persistent memory.

  • A single read from read-only memory can be transferred to half the deformation (and not in your case, since each loading of a stream is from its own tid).
  • The read-only memory is cached (not used in your case, since you only read once from each position in the read-only memory array).

Since each thread in the half-step does one read for different data, 16 different reads become serialized, taking 16 times more time to place the request.

If they read from global memory, the request is executed simultaneously, combined. This is why your example of global memory is better than read-only memory.

Of course, this conclusion may vary depending on the devices of computational capability 2.x with cache L1 and L2.

Hello!

+9
source

All Articles