CUDA Reduction: What is N?

Question

CUDA Reduction: What is N?

According to NVIDIA, this is the fastest core amount reduction:

template <unsigned int blockSize> __device__ void warpReduce(volatile int *sdata, unsigned int tid) { if (blockSize >= 64) sdata[tid] += sdata[tid + 32]; if (blockSize >= 32) sdata[tid] += sdata[tid + 16]; if (blockSize >= 16) sdata[tid] += sdata[tid + 8]; if (blockSize >= 8) sdata[tid] += sdata[tid + 4]; if (blockSize >= 4) sdata[tid] += sdata[tid + 2]; if (blockSize >= 2) sdata[tid] += sdata[tid + 1]; } template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) + tid; unsigned int gridSize = blockSize*2*gridDim.x; sdata[tid] = 0; while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; } __syncthreads(); if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); } if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); } if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); } if (tid < 32) warpReduce(sdata, tid); if (tid == 0) g_odata[blockIdx.x] = sdata[0]; }

However, I do not understand the "n" parameter. Any clues? I do not think that the size of the array will decrease, since a buffer overflow will occur in the while loop.

+7

parallel-processing sum cuda

Michael Eilers Smith Nov 18 '11 at 0:52

source share

1 answer

Jared hoberock · Accepted Answer · 2011-11-18T01:14:17+0000

I believe that you found a typo in the slides (probably it should be something like while(i + blockDim.x < n) ).

If you look at the source code in the CUDA SDK "reduction" example, the body of the last reduce6 looks like this:

 template <class T, unsigned int blockSize, bool nIsPow2> __global__ void reduce6(T *g_idata, T *g_odata, unsigned int n) { T *sdata = SharedMemory<T>(); // perform first level of reduction, // reading from global memory, writing to shared memory ... T mySum = 0; // we reduce multiple elements per thread. The number is determined by the // number of active thread blocks (via gridDim). More blocks will result // in a larger gridSize and therefore fewer elements per thread while (i < n) { mySum += g_idata[i]; // ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays if (nIsPow2 || i + blockSize < n) mySum += g_idata[i+blockSize]; i += gridSize; }

Note the explicit check inside while , which prevents access outside of g_idata access. Your initial suspicion is true; n is just the size of the g_idata array.

CUDA Reduction: What is N?

More articles: