Memory Sharing Analysis of My CUDA Kernel

I would like to read (BS_X + 1) * (BS_Y + 1) the global memory locations by the BS_x * BS_Y threads, moving the contents to the shared memory, and I developed the following code.

int i = threadIdx.x; int j = threadIdx.y; int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x; int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y; int index1 = j*BLOCK_SIZE_Y+i; int i1 = (index1)%(BLOCK_SIZE_X+1); int j1 = (index1)/(BLOCK_SIZE_Y+1); int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1); int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1); __shared__ double Ezx_h_shared_ext[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1]; Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)]; if ((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1))) Ezx_h_shared_ext[i2][j2]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j2)*xdim+(blockIdx.x*BLOCK_SIZE_X+i2)]; 

In my understanding, coalescence is the parallel equivalent of sequential reads of sequential processing memory. How can I detect if global memory accesses are combined? I note that there is an index transition from (i1, j1) to (i2, j2). Thanks in advance.

+6
cuda shared
source share
3 answers

I rated the memory access of your code using a hand-held coalescence analyzer. Evaluation shows that code uses less coalescence. Here's a coalescing analyzer that might come in handy:

 #include <stdio.h> #include <malloc.h> typedef struct dim3_t{ int x; int y; } dim3; // KERNEL LAUNCH PARAMETERS #define GRIDDIMX 4 #define GRIDDIMY 4 #define BLOCKDIMX 16 #define BLOCKDIMY 16 // ARCHITECTURE DEPENDENT // number of threads aggregated for coalescing #define COALESCINGWIDTH 32 // number of bytes in one coalesced transaction #define CACHEBLOCKSIZE 128 #define CACHE_BLOCK_ADDR(addr,size) (addr*size)&(~(CACHEBLOCKSIZE-1)) int main(){ // fixed dim3 variables // grid and block size dim3 blockDim,gridDim; blockDim.x=BLOCKDIMX; blockDim.y=BLOCKDIMY; gridDim.x=GRIDDIMX; gridDim.y=GRIDDIMY; // counters int unq_accesses=0; int *unq_addr=(int*)malloc(sizeof(int)*COALESCINGWIDTH); int total_unq_accesses=0; // iter over total number of threads // and count the number of memory requests (the coalesced requests) int I, II, III; for(I=0; I<GRIDDIMX*GRIDDIMY; I++){ dim3 blockIdx; blockIdx.x = I%GRIDDIMX; blockIdx.y = I/GRIDDIMX; for(II=0; II<BLOCKDIMX*BLOCKDIMY; II++){ if(II%COALESCINGWIDTH==0){ // new coalescing bunch total_unq_accesses+=unq_accesses; unq_accesses=0; } dim3 threadIdx; threadIdx.x=II%BLOCKDIMX; threadIdx.y=II/BLOCKDIMX; //////////////////////////////////////////////////////// // Change this section to evaluate different accesses // //////////////////////////////////////////////////////// // do your indexing here #define BLOCK_SIZE_X BLOCKDIMX #define BLOCK_SIZE_Y BLOCKDIMY #define xdim 32 int i = threadIdx.x; int j = threadIdx.y; int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x; int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y; int index1 = j*BLOCK_SIZE_Y+i; int i1 = (index1)%(BLOCK_SIZE_X+1); int j1 = (index1)/(BLOCK_SIZE_Y+1); int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1); int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1); // calculate the accessed location and offset here // change the line "Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];" to int addr = (blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1); int size = sizeof(double); ////////////////////////// // End of modifications // ////////////////////////// printf("tid (%d,%d) from blockid (%d,%d) accessing to block %d\n",threadIdx.x,threadIdx.y,blockIdx.x,blockIdx.y,CACHE_BLOCK_ADDR(addr,size)); // check whether it can be merged with existing requests or not short merged=0; for(III=0; III<unq_accesses; III++){ if(CACHE_BLOCK_ADDR(addr,size)==CACHE_BLOCK_ADDR(unq_addr[III],size)){ merged=1; break; } } if(!merged){ // new cache block accessed over this coalescing width unq_addr[unq_accesses]=CACHE_BLOCK_ADDR(addr,size); unq_accesses++; } } } printf("%d threads make %d memory transactions\n",GRIDDIMX*GRIDDIMY*BLOCKDIMX*BLOCKDIMY, total_unq_accesses); } 

The code will be run for each flow of the grid and calculates the number of combined requests, the metric of combining memory access codes.

To use the analyzer, insert the part of calculating the index of your code into the specified area and decompose the memory accesses (array) into "address" and "size". I already did this for your code, where is the indexing:

 int i = threadIdx.x; int j = threadIdx.y; int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x; int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y; int index1 = j*BLOCK_SIZE_Y+i; int i1 = (index1)%(BLOCK_SIZE_X+1); int j1 = (index1)/(BLOCK_SIZE_Y+1); int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1); int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1); 

and memory access:

 Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)]; 

The analyzer reports access of 4096 threads to 4064 cache blocks. Run the code for the actual mesh and block size and analyze the coalescence behavior.

+5
source share

As GPUs evolve, access pooling requirements have become less restrictive. Your description of federated access is more accurate for earlier GPU architectures than later ones. In particular, Fermi (computing power 2.0) has significantly weakened the requirements. On Fermi and later, it is not important to access memory cells sequentially. Instead, the focus has switched to memory access with a minimum number of memory transactions. In Fermi, global memory transactions are 128 bytes. That way, when 32 threads in warp fall into an instruction that loads or stores, 128-byte transactions will be scheduled to serve all threads in warp. Performance depends on the number of transactions. If all value access flows in a 128-byte area that is aligned with 128 bytes, one transaction is needed. If all stream access values ​​are in different 128-byte areas, 32 transactions are required. This would be the worst case for serving single instruction requests in warp.

You use one of the CUDA profilers to determine the average number of transactions required to service requests. The number should be as close as possible to 1. Higher numbers mean that you should see if there is room for optimizing memory access in your kernel.

+2
source share

visual profiler is a great tool to test your work. Once you have the functionally correct code, run it from the visual profiler. For example, on Linux, assuming you have an X session, just start nvvp from a terminal window. Then you will be presented with a wizard that will offer you an application to profile along with any command line parameters.

The profiler will then run a basic launch of your application to collect statistics. You can also choose a more complex statistical collection (requiring additional runs), and one of them will be memory usage statistics. It will report memory usage as a percentage of the peak, and will also warn you about what it considers to be low usage that deserves your attention.

If you have a usage number above 50%, your application probably works as you expect. If you have a small amount, you probably missed some details of the merger. It will report statistics separately for reading and writing to memory. To get 100% or close to it, you also need to make sure that your combined reads and writes from warp are aligned at 128 byte boundaries.

A common mistake in these situations is to use the threadIdx.y variable, which will be the fastest-changing index. I don’t think you made this mistake. for example, this is a common error shared[threadIdx.x][threadIdx.y] , because it is often the way we think about it in C. But the threads are grouped first along the x axis, so we want to use shared[threadIdx.y][threadIdx.x] or something like that. If you make this mistake, your code will still be functionally correct, but you will get low percentage usage numbers in the profiler, for example about 12% or even 3%.

As already mentioned, in order to get more than 50% and get closer to 100%, you will want to make sure that not only all your flow requests are adjacent, but also aligned on the 128B border. Because of the L1 / L2 caches, these are not hard and fast rules, but recommendations. Cache can mitigate some errors to some extent.

+1
source share

All Articles