You must declare the shared memory variable as mutable:
__shared__ volatile int data [BLOCK_SIZE];
The problem you see is an Fermi architecture artifact and compiler optimization. The Fermi architecture lacks instructions for working directly with shared memory (they were present in the G80 / 90 / GT200 series). Thus, everything is loaded for registration, management and storage in shared memory. But the compiler can freely deduce that the code can be done faster if a series of operations were put in the register, without intermediate loads and storages from / to shared memory. This is great if you do not rely on implicit synchronization of threads within the same warp managing shared memory, as in this reduction code.
By declaring the shared memory buffer as volatile, you force the compiler to force a write of shared memory after each recovery step and restore implicit data synchronization between threads at the core.
This issue is discussed in the Fermi programming notes, which sends (or possibly sends) using CUDA tools.
talonmies
source share