How to choose grid and block size for CUDA cores?

Question

How to choose grid and block size for CUDA cores?

This is a question about how to determine the CUDA size, block and thread sizes. This is an additional question posted here:

stack overflow

Following this link, the response from talonmies contains a piece of code (see below). I don’t understand the meaning of the comment, which is usually chosen by the setting and hardware limitations.

I did not find a good explanation or explanation explaining this in the CUDA documentation. So my question is how to determine the optimal block size (= number of threads) given the following code:

const int n = 128 * 1024; int blocksize = 512; // value usually chosen by tuning and hardware constraints int nblocks = n / nthreads; // value determine by block size and total work madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);

By the way, I started my question with the link above, because it partially answers my first question. If this is not the right way to ask questions about stack overflows, please excuse me or let me know.

+71

performance optimization gpu cuda nvidia

user1292251 Apr 03 '12 at 1:14

source share

3 answers

The answers above indicate how block size can affect performance and offer a general heuristic for its selection based on maximizing employment. Not wanting to provide a criterion for choosing the block size, it is worth mentioning that CUDA 6.5 (now in Release Candidate version) includes several new run-time functions that help in calculating the busyness and launch configuration, see

CUDA Pro Tip: The Employment API simplifies startup customization

One useful feature is cudaOccupancyMaxPotentialBlockSize , which heuristically calculates the block size that reaches maximum occupancy. The values provided by this function can then be used as a starting point for manually optimizing startup parameters. Below is a small example.

 #include <stdio.h> /************************/ /* TEST KERNEL FUNCTION */ /************************/ __global__ void MyKernel(int *a, int *b, int *c, int N) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < N) { c[idx] = a[idx] + b[idx]; } } /********/ /* MAIN */ /********/ void main() { const int N = 1000000; int blockSize; // The launch configurator returned block size int minGridSize; // The minimum grid size needed to achieve the maximum occupancy for a full device launch int gridSize; // The actual grid size needed, based on input size int* h_vec1 = (int*) malloc(N*sizeof(int)); int* h_vec2 = (int*) malloc(N*sizeof(int)); int* h_vec3 = (int*) malloc(N*sizeof(int)); int* h_vec4 = (int*) malloc(N*sizeof(int)); int* d_vec1; cudaMalloc((void**)&d_vec1, N*sizeof(int)); int* d_vec2; cudaMalloc((void**)&d_vec2, N*sizeof(int)); int* d_vec3; cudaMalloc((void**)&d_vec3, N*sizeof(int)); for (int i=0; i<N; i++) { h_vec1[i] = 10; h_vec2[i] = 20; h_vec4[i] = h_vec1[i] + h_vec2[i]; } cudaMemcpy(d_vec1, h_vec1, N*sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(d_vec2, h_vec2, N*sizeof(int), cudaMemcpyHostToDevice); float time; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, MyKernel, 0, N); // Round up according to array size gridSize = (N + blockSize - 1) / blockSize; cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf("Occupancy calculator elapsed time: %3.3f ms \n", time); cudaEventRecord(start, 0); MyKernel<<<gridSize, blockSize>>>(d_vec1, d_vec2, d_vec3, N); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf("Kernel elapsed time: %3.3f ms \n", time); printf("Blocksize %i\n", blockSize); cudaMemcpy(h_vec3, d_vec3, N*sizeof(int), cudaMemcpyDeviceToHost); for (int i=0; i<N; i++) { if (h_vec3[i] != h_vec4[i]) { printf("Error at i = %i! Host = %i; Device = %i\n", i, h_vec4[i], h_vec3[i]); return; }; } printf("Test passed\n"); }

EDIT

cudaOccupancyMaxPotentialBlockSize defined in the cuda_runtime.h file and is defined as follows:

 template<class T> __inline__ __host__ CUDART_DEVICE cudaError_t cudaOccupancyMaxPotentialBlockSize( int *minGridSize, int *blockSize, T func, size_t dynamicSMemSize = 0, int blockSizeLimit = 0) { return cudaOccupancyMaxPotentialBlockSizeVariableSMem(minGridSize, blockSize, func, __cudaOccupancyB2DHelper(dynamicSMemSize), blockSizeLimit); }

The values for the parameters are as follows

 minGridSize = Suggested min grid size to achieve a full machine launch. blockSize = Suggested block size to achieve maximum occupancy. func = Kernel function. dynamicSMemSize = Size of dynamically allocated shared memory. Of course, it is known at runtime before any kernel launch. The size of the statically allocated shared memory is not needed as it is inferred by the properties of func. blockSizeLimit = Maximum size for each block. In the case of 1D kernels, it can coincide with the number of input elements.

Please note that with CUDA 6.5 you need to calculate one 2D / 3D block size from the 1D block size proposed by the API.

Note also that the CUDA driver API contains functionally equivalent APIs for calculating occupancy, so cuOccupancyMaxPotentialBlockSize can be used in the API driver code in the same way as the execution API shown in the above example.

+27

JackOLantern Jul 29 '14 at 7:52

source share

The block size is usually chosen to maximize "busyness." Search CUDA Occupancy for more information. In particular, see the CUDA Electronic Costing Chart.

+10

Roger Dahl Apr 03 '12 at 1:40

source share

talonmies · Accepted Answer · 2012-04-03 03:34

There are two parts to the answer (I wrote). One part is easy to quantify, the other is more empirical.

Hardware limitations:

It is easy to measure part. Appendix F of the current CUDA programming guide contains a number of stringent restrictions that limit the number of threads per block that can be started by the kernel. If you exceed any of them, your kernel will never start. They can be summarized as follows:

Each block cannot have more than 512/1024 threads ( Compute Capability 1.x or 2.x-3.x, respectively)
The maximum sizes of each block are limited to [512,512,64] / [1024,1024,64] (calculate 1.x / 2.x)
Each block cannot consume more than 8k / 16k / 32k registers (Compute 1.0.1.1/1.2.1.3/2.x)
Each block cannot consume more than 16 KB / 48 KB of shared memory (Compute 1.x / 2.x)

If you stay within these limits, any kernel that you can successfully compile will start without errors.

Performance tuning:

This is the empirical part. The number of threads for each block that you select within the framework of the hardware limitations described above can and does affect the performance of the code executed on the hardware. How each code behaves will be different, and the only real way to quantify it is through thorough benchmarking and profiling. But then again, very roughly generalized:

The number of threads in the block should be round, a multiple of the size of the base, which is 32 on all current equipment.
Each streaming multiprocessor unit on the GPU must have enough active digging to sufficiently hide all the various memory and pipeline operating time in the architecture and achieve maximum throughput. The orthodox approach here is to try to achieve optimal equipment loading (which is Roger Dahl's answer ).

The second point is a huge topic, in which I doubt that someone will try to cover it in one StackOverflow answer. There are people who write PhD dissertations on the quantitative analysis of aspects of the problem (see this presentation by Vasily Volkov from UC Berkley and this article by Henry Wong from the University of Toronto for examples of how complicated this question is).

At the initial level, you should basically know that the size of the block that you choose (within the range of legal block sizes defined by the restrictions above) can and can affect the speed of your code, but it depends on the hardware that you have and on the code you are using. By benchmarking, you will probably find that most non-trivial codes have a “sweet spot” in 128-512 threads per block range, but this will require analysis on your part to find where it is. The good news is that since you work in multiple warp sizes, the search space is very limited, and the best configuration for a given piece of code is relatively easy to find.

How to choose grid and block size for CUDA cores?

Hardware limitations:

Performance tuning:

More articles: