Loading data into shared memory for the convolution kernel

I am having some difficulties understanding the load of the batch, as in the comments. To calculate the convolution in a pixel, a mask whose size is 5 must be centered on that particular pixel. The image is divided into tiles. These tiles, after applying the convolution mask, are the final output tiles whose size is TILE_WIDTH*TILE_WIDTH . For pixels that belong to the border of the output tile, the mask should take some pixels from the adjacent tile when this tile belongs to the borders of the image. Otherwise, these borrowed values ​​are assigned to zero. These two steps are depicted in

 if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width) N_ds[destY][destX] = I[src]; else N_ds[destY][destX] = 0; 

For this reason, the shared memory array has a size of TILE_WIDTH + Mask_width - 1 on each side. The following parts of the code are unclear to me.

  • Index destY and destX . Dividing the output index by the width of the input tile, what does it mean?
  • Index srcY add srcX . Why do destY and destX index participate in srcY add srcX index?

    srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;

    srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;

  • Why do we use the offset TILE_WIDTH * TILE_WIDTH in the second boot?
  • Generally, what is an intuitive explanation for having two loads?
  • Can all these questions follow an intuitive example based on the image below?
  • Thanks!

EDIT: Image added. There are output tiles in green, and in red we have a mask centered at index 114. Obviously, the mask borrows elements from different fragments. Finally, this image refers to one channel.

Example: Based on the image below, I tried to write an example. The output tile has blockIdx.x=1 and blockIdx.y=1 based on destY=0 and destX=0 . Also, srcY = 1*6+0-3=3 , srcX = 3 and src = (3*18+3)*3+0=171 . Based on calculations and sample images, we do not have a match. In the first memory allocation, you can save a value that must be stored with a global index of 57 . What is wrong with the above calculations? Can anyone help?

enter image description here

 #define Mask_width 5 #define Mask_radius Mask_width/2 #define TILE_WIDTH 16 #define w (TILE_WIDTH + Mask_width - 1) #define clamp(x) (min(max((x), 0.0), 1.0)) __global__ void convolution(float *I, const float* __restrict__ M, float *P, int channels, int width, int height) { __shared__ float N_ds[w][w]; int k; for (k = 0; k < channels; k++) { // First batch loading int dest = threadIdx.y * TILE_WIDTH + threadIdx.x, destY = dest / w, destX = dest % w, srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius, srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius, src = (srcY * width + srcX) * channels + k; if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width) N_ds[destY][destX] = I[src]; else N_ds[destY][destX] = 0; // Second batch loading dest = threadIdx.y * TILE_WIDTH + threadIdx.x + TILE_WIDTH * TILE_WIDTH; destY = dest / w, destX = dest % w; srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius; srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius; src = (srcY * width + srcX) * channels + k; if (destY < w) { if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width) N_ds[destY][destX] = I[src]; else N_ds[destY][destX] = 0; } __syncthreads(); float accum = 0; int y, x; for (y = 0; y < Mask_width; y++) for (x = 0; x < Mask_width; x++) accum += N_ds[threadIdx.y + y][threadIdx.x + x] * M[y * Mask_width + x]; y = blockIdx.y * TILE_WIDTH + threadIdx.y; x = blockIdx.x * TILE_WIDTH + threadIdx.x; if (y < height && x < width) P[(y * width + x) * channels + k] = clamp(accum); __syncthreads(); } } 
+7
gpu cuda
source share
1 answer

Your question is similar in my first question about StackOverflow: Moving the global memory matrix (BS_X + 1) (BS_Y + 1) across BS_XBS_Y streams .

You have encountered the following problem: each stream block of size TILE_WIDTHxTILE_WIDTH should fill the shared memory area with the size (TILE_WIDTH + Mask_width - 1)x(TILE_WIDTH + Mask_width - 1) .

4) As a rule, what is an intuitive explanation for the presence of two loads?

Since the shared memory area (TILE_WIDTH + Mask_width - 1)x(TILE_WIDTH + Mask_width - 1) larger than the block size TILE_WIDTHxTILE_WIDTH and provided that it is less than 2xTILE_WIDTHxTILE_WIDTH , then each stream must move no more than two elements from global memory to shared memory . It is for this reason that you have a two-step download.

1) Index destY and destX . Dividing the output index by the width of the input tile, what does it mean?

This concerns the first loading stage, which is assigned to load TILE_WIDTHxTILE_WIDTH elements from global memory and fills the very top of the shared memory area.

So the operation

 dest = threadIdx.y * TILE_WIDTH + threadIdx.x; 

aligns the 2D coordinates of the overall stream, and

 destX = dest % w; destY = dest / w; 

performs the inverse operation, because it calculates the 2D coordinates of the total stream with respect to the shared memory area.

2) Index srcY add srcX . Why do destY and destX index participate in srcY add srcX index?

 srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius; srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius; 

(blockIdx.x * TILE_WIDTH, blockIdx.y * TILE_WIDTH) will be the coordinates of the global memory cell if the block size and shared memory size were the same. Since you also β€œborrow” memory values ​​from neighboring fragments, you need to shift the above coordinates by (destX - Mask_radius, destY - Mask_radius) .

3) Why do we use the offset TILE_WIDTH * TILE_WIDTH in the second boot?

You have this offset, because in the first memory step you already filled in the "first" TILE_WIDTHxTILE_WIDTH the shared memory location.

EDIT

The following figure shows the correspondence between the index of smoothed threads dest and the locations of shared memory. In the figure, blue squares represent elements of a common tile, and red squares represent elements of adjacent tiles. The combination of blue and red boxes corresponds to the common places of common memory. As you can see, all threads 256 of the stream block are involved in filling the upper part of the shared memory above the green line, while only 145 involved in filling the lower part of the shared memory under the green line. You should now also understand the offset TILE_WIDTH x TILE_WIDTH .

Please note that you have no more than 2 memory loads in the stream due to the specific choice of your parameters. For example, if you have TILE_WIDTH = 8 , then the number of threads in the stream block is 64 , and the total memory size is 12x12=144 , which means that each thread is responsible for at least 2 shared memory being written with 144/64=2.25 .

enter image description here

+6
source share

All Articles