Your question is similar in my first question about StackOverflow: Moving the global memory matrix (BS_X + 1) (BS_Y + 1) across BS_XBS_Y streams .
You have encountered the following problem: each stream block of size TILE_WIDTHxTILE_WIDTH should fill the shared memory area with the size (TILE_WIDTH + Mask_width - 1)x(TILE_WIDTH + Mask_width - 1) .
4) As a rule, what is an intuitive explanation for the presence of two loads?
Since the shared memory area (TILE_WIDTH + Mask_width - 1)x(TILE_WIDTH + Mask_width - 1) larger than the block size TILE_WIDTHxTILE_WIDTH and provided that it is less than 2xTILE_WIDTHxTILE_WIDTH , then each stream must move no more than two elements from global memory to shared memory . It is for this reason that you have a two-step download.
1) Index destY and destX . Dividing the output index by the width of the input tile, what does it mean?
This concerns the first loading stage, which is assigned to load TILE_WIDTHxTILE_WIDTH elements from global memory and fills the very top of the shared memory area.
So the operation
dest = threadIdx.y * TILE_WIDTH + threadIdx.x;
aligns the 2D coordinates of the overall stream, and
destX = dest % w; destY = dest / w;
performs the inverse operation, because it calculates the 2D coordinates of the total stream with respect to the shared memory area.
2) Index srcY add srcX . Why do destY and destX index participate in srcY add srcX index?
srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius; srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;
(blockIdx.x * TILE_WIDTH, blockIdx.y * TILE_WIDTH) will be the coordinates of the global memory cell if the block size and shared memory size were the same. Since you also βborrowβ memory values ββfrom neighboring fragments, you need to shift the above coordinates by (destX - Mask_radius, destY - Mask_radius) .
3) Why do we use the offset TILE_WIDTH * TILE_WIDTH in the second boot?
You have this offset, because in the first memory step you already filled in the "first" TILE_WIDTHxTILE_WIDTH the shared memory location.
EDIT
The following figure shows the correspondence between the index of smoothed threads dest and the locations of shared memory. In the figure, blue squares represent elements of a common tile, and red squares represent elements of adjacent tiles. The combination of blue and red boxes corresponds to the common places of common memory. As you can see, all threads 256 of the stream block are involved in filling the upper part of the shared memory above the green line, while only 145 involved in filling the lower part of the shared memory under the green line. You should now also understand the offset TILE_WIDTH x TILE_WIDTH .
Please note that you have no more than 2 memory loads in the stream due to the specific choice of your parameters. For example, if you have TILE_WIDTH = 8 , then the number of threads in the stream block is 64 , and the total memory size is 12x12=144 , which means that each thread is responsible for at least 2 shared memory being written with 144/64=2.25 .
