You have several questions in one, so I will try to address them separately.
One block for SM
I asked about this some time ago on nVidia's own forums , as I was getting results that indicated that this was not what was happening. Apparently, the block scheduler will not assign a block to SM if the number of blocks is equal to the number of SM.
Implicit Sync
Not. First of all, you cannot guarantee that each block will have its own SM (see above). Secondly, all blocks cannot simultaneously access the global storage. If they are executed synchronously at all, they will lose this synchronism from the first memory read / write.
Sync lock
Now for the good news: Yes, you can. You can use the atomic commands described in Section B.11 of the CUDA C Programming Guide to create a barrier. Suppose that your GPU runs N blocks at the same time.
__device__ int barrier = N; __global__ void mykernel ( ) { ... __syncthreads(); if ( threadIdx.x == 0 ) atomicSub( &barrier , 1 ); if ( threadIdx.x == 0 ) while ( atomicCAS( &barrier , 0 , 0 ) != 0 ); __syncthreads(); ... }
The atomicSub(p,i) command atomically calculates *p -= i and is called only by the zero thread in the block, i.e. we only want to reduce the barrier once. The atomicCAS(p,c,v) command sets *p = v iff *p == c and returns the old value *p . This part simply sings until the barrier reaches 0 , i.e. Until all the blocks cross it.
Note that you must wrap this part when calling __synchtreads() , since the threads in the block are not executed in the strict blocking step, and you must make them all wait for the zero thread.
Just remember that if you call your kernel more than once, you must set the barrier back to N
Update
In response to jHackTheRipper's answer and Cicada's comment, I should have indicated that you should not try to run more blocks than can be planned on the GPU chart! This is limited by a number of factors, and you should use the CUDA employment calculator to find the maximum number of blocks for your kernel and device.
Judging by the original question, however, only as many blocks as there are SMs are being launched, so this point is debatable.