Cuda dummy / implicit block synchronization

I know that block synchronization is not possible, the only way is to launch a new kernel.

BUT, suppose I run X blocks where X corresponds to the number of SMs on my GPU. I should consider that the scheduler will assign a block to each SM ... right? And if the GPU is used as an additional graphics card (fully dedicated to CUDA), this means that, theoretically, no other process uses it ... right?

My idea is this: implicit synchronization.

Suppose that sometimes I need only one block, and sometimes I need all X-blocks. Well, in cases where I need only one block, I can configure my code so that the first block (or the first SM) works on the "real" data, and the other blocks X-1 (or SM) on some "dummy" data executing exactly the same instruction, but with some other bias.

So that they all continue to synchronize until I need them all again.

Is the planner reliable under these conditions? Or can you not be sure?

+4
source share
2 answers

You have several questions in one, so I will try to address them separately.

One block for SM

I asked about this some time ago on nVidia's own forums , as I was getting results that indicated that this was not what was happening. Apparently, the block scheduler will not assign a block to SM if the number of blocks is equal to the number of SM.

Implicit Sync

Not. First of all, you cannot guarantee that each block will have its own SM (see above). Secondly, all blocks cannot simultaneously access the global storage. If they are executed synchronously at all, they will lose this synchronism from the first memory read / write.

Sync lock

Now for the good news: Yes, you can. You can use the atomic commands described in Section B.11 of the CUDA C Programming Guide to create a barrier. Suppose that your GPU runs N blocks at the same time.

 __device__ int barrier = N; __global__ void mykernel ( ) { /* Do whatever it is that this block does. */ ... /* Make sure all threads in this block are actually here. */ __syncthreads(); /* Once we're done, decrease the value of the barrier. */ if ( threadIdx.x == 0 ) atomicSub( &barrier , 1 ); /* Now wait for the barrier to be zero. */ if ( threadIdx.x == 0 ) while ( atomicCAS( &barrier , 0 , 0 ) != 0 ); /* Make sure everybody has waited for the barrier. */ __syncthreads(); /* Carry on with whatever else you wanted to do. */ ... } 

The atomicSub(p,i) command atomically calculates *p -= i and is called only by the zero thread in the block, i.e. we only want to reduce the barrier once. The atomicCAS(p,c,v) command sets *p = v iff *p == c and returns the old value *p . This part simply sings until the barrier reaches 0 , i.e. Until all the blocks cross it.

Note that you must wrap this part when calling __synchtreads() , since the threads in the block are not executed in the strict blocking step, and you must make them all wait for the zero thread.

Just remember that if you call your kernel more than once, you must set the barrier back to N

Update

In response to jHackTheRipper's answer and Cicada's comment, I should have indicated that you should not try to run more blocks than can be planned on the GPU chart! This is limited by a number of factors, and you should use the CUDA employment calculator to find the maximum number of blocks for your kernel and device.

Judging by the original question, however, only as many blocks as there are SMs are being launched, so this point is debatable.

+3
source

@ Pedro is definitely wrong!

Achieving global synchronization has been the subject of several research studies recently and, finally, for non-Keplerian architectures (I do not have them yet). The conclusion is always the same (or should be): it is impossible to achieve such global synchronization across the entire GPU.

The reason is simple: CUDA blocks cannot be unloaded, so given that you are completely using the GPU, threads waiting for the rendez-vous barrier will never let the block shut down. Thus, it will not be removed from SM and will prevent the rest of the blocks from starting.

As a result, you simply freeze the GPU, which can never get out of this deadlock state.

- edit to answer Pedro's comments -

Such flaws were noticed by other authors, such as: http://www.openclblog.com/2011/04/eureka.html

OpenCL author in action

- edit to answer Pedro's minor comments -

The same conclusion is made by @Jared Hoberock in this post SO: Inter-block barrier on CUDA
-4
source

All Articles