Just to point to an alternative:
You can also use the inline __syncthreads() build equivalent, which allows you to use an optional argument for the number of participating threads, available with 2.0 compute capability. Something like this should work:
#define __syncthreads_active(active_threads) asm volatile("bar.sync 0, %0;" :: "r"(active_threads)); if(threadIdx.x >= NTHREADS/2) return; int active_warps = (NTHREADS/2 + warpSize) / warpSize; int active_threads = active_warps * warpSize;
DISCLAIMER: written in a browser and not fully verified!
Is it worth the worry, this is another question.
source share