How to call __device__ function in CUDA with fewer threads

I want to call the exclusive scan function from inside the kernel, which performs sorting by base. But an exclusive check requires only half of the threads to do its job.

The exclusive scan algorithm requires several __syncthreads (). If I have an instruction at the beginning, for example

if (threadIdx.x> NTHREADS / 2) return;

these threads will not participate in exclusive syncthreads checks, which is prohibited. Is there any way around this problem. I have an exclusive scan call surrounded by __syncthread () s.

+4
source share
2 answers

Something like this should work (do not use early return):

__syncthreads(); // at entry to exclusive scan region // begin exclusive scan function if (threadIdx.x < NTHREADS/2) { // do first phase of exclusive scan up to first syncthreads } __syncthreads(); // first syncthreads in exclusive scan function if (threadIdx.x < NTHREADS/2) { // do second phase of exclusive scan up to second syncthreads } __syncthreads(); // second syncthreads in exclusive scan function (... etc.) __syncthreads(); // at exit from exclusive scan region 

This is somewhat tedious, but this is the only way I know to stick to the letter of the __syncthreads() usage law. You can also try just leaving the code as you indicated, with threads that do not do work, early return / exit. It may just work, it may work. But there is no guarantee that it will work for future architecture or a new tool chain.

+4
source

Just to point to an alternative:
You can also use the inline __syncthreads() build equivalent, which allows you to use an optional argument for the number of participating threads, available with 2.0 compute capability. Something like this should work:

 #define __syncthreads_active(active_threads) asm volatile("bar.sync 0, %0;" :: "r"(active_threads)); if(threadIdx.x >= NTHREADS/2) return; int active_warps = (NTHREADS/2 + warpSize) / warpSize; int active_threads = active_warps * warpSize; // hopefully the compiler will optimize this to a simple active_threads = (NTHREADS/2 + warpSize) & ~32 __syncthreads_active(active_threads); // do some work... __syncthreads_active(active_threads); // do some more work... __syncthreads_active(active_threads); 

DISCLAIMER: written in a browser and not fully verified!

Is it worth the worry, this is another question.

+2
source

All Articles