GPU synchronization

I have a question about how GPUs synchronize. As I know, when a warp encounters a barrier (assuming it is in OpenCL), and he knows that other warps of the same group have not been there yet. Therefore, he must wait. But what exactly does this warp do during the wait time? Is it still active deformation? Or will he do some kind of null operation?

As I noticed, when we have synchronization in the kernel, the number of instructions increases. Interestingly, what is the source of this increment. Is synchronization broken into several smaller GPU instructions? Or due to the fact that unused skews follow some additional instructions?

And finally, I am very surprised if the cost added by synchronization depends on one without synchronization (for example, the barrier (CLK_LOCAL_MEM_FENCE)) depends on the number of warp in the workgroup (or threadblock)? Thanks

+7
source share
1 answer

Active warp is an object that resides on SM, i.e. all resources (registers, etc.) were allocated, and warp is available for execution, ensuring its planning. If warp reaches the barrier to other skews in the same branch / workgroup, it will still be active (it still remains on SM, and all its registers are still valid), but it will not follow any instructions because it is not ready to planning.

Inserting a barrier not only restricts execution, but also acts as a barrier for the compiler: the compiler is not allowed to perform most optimizations across the barrier, as this may invalidate the goal of the barrier. This is the most likely reason you see more instructions - without a barrier, the compiler can perform more optimizations.

The cost of the barrier is very dependent on what your code does, but each barrier introduces a bubble where all the deformations must (standstill) stand idle before they all start working again, so if you have a very large threadblock / workgroup then of course potentially larger bubble than with a small block. The effect of the bubble depends on your code - if your code is very memory-bound, then the barrier will display memory delays, where earlier they could be hidden, but if they are more balanced, this may have a less noticeable effect.

This means that in a very memory-related kernel, you might be better off running a larger number of smaller blocks so that other blocks can run when one block bubbles onto the barrier. You will need to make sure that your filling increases as a result, and if you are sharing data between threads using shared block memory, then there is a trade-off that you need to have.

+7
source

All Articles