The answers above indicate how block size can affect performance and offer a general heuristic for its selection based on maximizing employment. Not wanting to provide a criterion for choosing the block size, it is worth mentioning that CUDA 6.5 (now in Release Candidate version) includes several new run-time functions that help in calculating the busyness and launch configuration, see
CUDA Pro Tip: The Employment API simplifies startup customization
One useful feature is cudaOccupancyMaxPotentialBlockSize , which heuristically calculates the block size that reaches maximum occupancy. The values provided by this function can then be used as a starting point for manually optimizing startup parameters. Below is a small example.
#include <stdio.h> /************************/ /* TEST KERNEL FUNCTION */ /************************/ __global__ void MyKernel(int *a, int *b, int *c, int N) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < N) { c[idx] = a[idx] + b[idx]; } } /********/ /* MAIN */ /********/ void main() { const int N = 1000000; int blockSize; // The launch configurator returned block size int minGridSize; // The minimum grid size needed to achieve the maximum occupancy for a full device launch int gridSize; // The actual grid size needed, based on input size int* h_vec1 = (int*) malloc(N*sizeof(int)); int* h_vec2 = (int*) malloc(N*sizeof(int)); int* h_vec3 = (int*) malloc(N*sizeof(int)); int* h_vec4 = (int*) malloc(N*sizeof(int)); int* d_vec1; cudaMalloc((void**)&d_vec1, N*sizeof(int)); int* d_vec2; cudaMalloc((void**)&d_vec2, N*sizeof(int)); int* d_vec3; cudaMalloc((void**)&d_vec3, N*sizeof(int)); for (int i=0; i<N; i++) { h_vec1[i] = 10; h_vec2[i] = 20; h_vec4[i] = h_vec1[i] + h_vec2[i]; } cudaMemcpy(d_vec1, h_vec1, N*sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(d_vec2, h_vec2, N*sizeof(int), cudaMemcpyHostToDevice); float time; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, MyKernel, 0, N); // Round up according to array size gridSize = (N + blockSize - 1) / blockSize; cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf("Occupancy calculator elapsed time: %3.3f ms \n", time); cudaEventRecord(start, 0); MyKernel<<<gridSize, blockSize>>>(d_vec1, d_vec2, d_vec3, N); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf("Kernel elapsed time: %3.3f ms \n", time); printf("Blocksize %i\n", blockSize); cudaMemcpy(h_vec3, d_vec3, N*sizeof(int), cudaMemcpyDeviceToHost); for (int i=0; i<N; i++) { if (h_vec3[i] != h_vec4[i]) { printf("Error at i = %i! Host = %i; Device = %i\n", i, h_vec4[i], h_vec3[i]); return; }; } printf("Test passed\n"); }
EDIT
cudaOccupancyMaxPotentialBlockSize defined in the cuda_runtime.h file and is defined as follows:
template<class T> __inline__ __host__ CUDART_DEVICE cudaError_t cudaOccupancyMaxPotentialBlockSize( int *minGridSize, int *blockSize, T func, size_t dynamicSMemSize = 0, int blockSizeLimit = 0) { return cudaOccupancyMaxPotentialBlockSizeVariableSMem(minGridSize, blockSize, func, __cudaOccupancyB2DHelper(dynamicSMemSize), blockSizeLimit); }
The values for the parameters are as follows
minGridSize = Suggested min grid size to achieve a full machine launch. blockSize = Suggested block size to achieve maximum occupancy. func = Kernel function. dynamicSMemSize = Size of dynamically allocated shared memory. Of course, it is known at runtime before any kernel launch. The size of the statically allocated shared memory is not needed as it is inferred by the properties of func. blockSizeLimit = Maximum size for each block. In the case of 1D kernels, it can coincide with the number of input elements.
Please note that with CUDA 6.5 you need to calculate one 2D / 3D block size from the 1D block size proposed by the API.
Note also that the CUDA driver API contains functionally equivalent APIs for calculating occupancy, so cuOccupancyMaxPotentialBlockSize can be used in the API driver code in the same way as the execution API shown in the above example.