CUDA defines flows per block, blocks per grid

Question

CUDA defines flows per block, blocks per grid

I am new to the CUDA paradigm. My question is to determine the number of threads per block and blocks in the grid. Does art and test go into this? I have found that many examples seem to have an arbitrary number chosen for these things.

I am considering a problem where I could pass matrices - of any size - to the multiplication method. Thus, each element of C (as in C = A * B) will be calculated in one thread. How would you define streams / block, blocks / grid in this case?

+54

dimensions matrix-multiplication cuda nvidia

dnbwise Dec 08 2018-10-18

source share

4 answers

http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

The CUDA Occupancy Calculator allows you to calculate the multiprocessor filling of the GPU with this CUDA core. Multiprocessor padding is the ratio of active skews to the maximum number of skews supported on a multiprocessor GPU. Each multiprocessor on the device has a set of N registers available for use by CUDA program streams. These registers are a shared resource that is distributed between flow blocks executed on a multiprocessor. The CUDA compiler tries to minimize the use of registers in order to maximize the number of thread blocks that can be active in the machine at the same time. If a program tries to start a kernel for which registers were used for each thread, the thread block size is greater than N, the launch will fail ...

+19

jmilloy Dec 09 '10 at 4:54

source share

With rare exceptions, you should use a constant number of threads for each block. Then the number of blocks per grid is determined by the size of the problem, for example, the size of the matrix in the case of matrix multiplication.

Choosing the number of threads per block is very difficult. Most CUDA algorithms allow a wide range of possibilities, and the choice is based on what makes the kernel work most efficiently. This is almost always a multiple of 32 and at least 64 because of how the flow scheduling equipment works. A good choice for a first try is 128 or 256.

+15

Heatsink Dec 08 '10 at 7:20

source share

You also need to consider shared memory, since threads of the same block can access the same shared memory. If you are developing something that requires a large amount of shared memory, then more threads per block may be useful.

For example, in terms of context switching, any number in 32 works the same. Thus, for case 1D, starting 1 block with 64 threads or 2 blocks with 32 threads does not matter for accessing global memory. However, if the problem at hand naturally decomposes into one 64-bit vector, then the first option will be better (less memory overhead, each thread can access the same shared memory) than the second.

+3

ely Nov 08 2018-11-11T00:

source share

underpickled · Accepted Answer · 2012-10-16 19:11

In general, you want your blocks / grids to match your data and at the same time increase occupancy, that is, how many threads are simultaneously active. The main factors affecting employment are the use of shared memory, the use of registers, and the size of the stream block.

A CUDA-enabled GPU has its processing capabilities divided into SMs (stream multiprocessors), and the number of SMs depends on the real card, but here we focus on one SM for simplicity (they all behave the same), Each SM has a finite number of 32- bit registers, shared memory, the maximum number of active blocks and the maximum number of active threads. These numbers depend on the CC (computing power) of your GPU and can be found in the middle of the Wikipedia article http://en.wikipedia.org/wiki/CUDA .

First of all, the thread block size should always be a multiple of 32, because the kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue instructions for 64 threads, and you'll just squander them.

Secondly, before worrying about shared memory and registers, try block size based on the maximum number of threads and blocks corresponding to the computing power of your card. Sometimes there are several ways to do this ... for example, a CC 3.0 card for each SM can have 16 active blocks and 2048 active threads. This means that if you have 128 threads per block, you can put 16 blocks in your SM before you reach the thread limit of 2048. If you use 256 threads, you can only fit in 8, but you still use all available threads and you’ll still be full. However, when using 64 threads per block, only 1024 threads will be used when hitting 16 blocks, so only 50% of the occupancy. If shared memory and register usage are not a bottleneck, this should be your main concern (other than your data sizes).

On the topic of your grid ... the blocks in your grid are distributed by SM to run, and then the remaining blocks are placed in the pipeline. Blocks are moved to SM for processing, as soon as this SM has enough resources to take the block. In other words, when blocks are completed in SM, new ones move. You can make the argument that smaller blocks (128 instead of 256 in the previous example) can complete faster, since a particularly slow block will process fewer resources, but it depends a lot on the code.

Regarding registers and shared memory, look at the following, as this may limit your placement. The total memory is limited for the entire SM, so try to use it in an amount that allows as many blocks as possible to still fit on the SM. The same goes for register use. Again, these numbers are computationally dependent and can be found in the table on the wikipedia page. Good luck

CUDA defines flows per block, blocks per grid

More articles: