CUDA. How to deploy the first 32 threads so that they run in parallel?

I know that "each warp contains streams of sequential incrementing stream identifiers with the first warp containing stream 0", so the first 32 streams should be in the first warp. I also know that all threads in one deformation are executed simultaneously on any available streaming multiprocessor.

As I understand it, because of this, there is no need to synchronize threads if only one warp is running. But the code below gives the wrong answer if I delete any of __syncthreads() in the penultimate if block. I tried to find the reason, but received nothing. I really hope for your help so you can tell me what is wrong with this code? Why can't I leave only the last __syncthreads() and get the correct answer?

 #define BLOCK_SIZE 128 __global__ void reduce ( int * inData, int * outData ) { __shared__ int data [BLOCK_SIZE]; int tid = threadIdx.x; int i = blockIdx.x * blockDim.x + threadIdx.x; data [tid] = inData [i] + inData [i + blockDim.x / 2 ]; __syncthreads (); for ( int s = blockDim.x / 4; s > 32; s >>= 1 ) { if ( tid < s ) data [tid] += data [tid + s]; __syncthreads (); } if ( tid < 32 ) { data [tid] += data [tid + 32]; __syncthreads (); data [tid] += data [tid + 16]; __syncthreads (); data [tid] += data [tid + 8]; __syncthreads (); data [tid] += data [tid + 4]; __syncthreads (); data [tid] += data [tid + 2]; __syncthreads (); data [tid] += data [tid + 1]; __syncthreads (); } if ( tid == 0 ) outData [blockIdx.x] = data [0]; } void main() { ... reduce<<<dim3(128), dim3(128)>>>(dev_data, dev_res); ... } 

PS I am using GT560Ti

+7
source share
1 answer

You must declare the shared memory variable as mutable:

 __shared__ volatile int data [BLOCK_SIZE]; 

The problem you see is an Fermi architecture artifact and compiler optimization. The Fermi architecture lacks instructions for working directly with shared memory (they were present in the G80 / 90 / GT200 series). Thus, everything is loaded for registration, management and storage in shared memory. But the compiler can freely deduce that the code can be done faster if a series of operations were put in the register, without intermediate loads and storages from / to shared memory. This is great if you do not rely on implicit synchronization of threads within the same warp managing shared memory, as in this reduction code.

By declaring the shared memory buffer as volatile, you force the compiler to force a write of shared memory after each recovery step and restore implicit data synchronization between threads at the core.

This issue is discussed in the Fermi programming notes, which sends (or possibly sends) using CUDA tools.

+7
source

All Articles