Active warp is an object that resides on SM, i.e. all resources (registers, etc.) were allocated, and warp is available for execution, ensuring its planning. If warp reaches the barrier to other skews in the same branch / workgroup, it will still be active (it still remains on SM, and all its registers are still valid), but it will not follow any instructions because it is not ready to planning.
Inserting a barrier not only restricts execution, but also acts as a barrier for the compiler: the compiler is not allowed to perform most optimizations across the barrier, as this may invalidate the goal of the barrier. This is the most likely reason you see more instructions - without a barrier, the compiler can perform more optimizations.
The cost of the barrier is very dependent on what your code does, but each barrier introduces a bubble where all the deformations must (standstill) stand idle before they all start working again, so if you have a very large threadblock / workgroup then of course potentially larger bubble than with a small block. The effect of the bubble depends on your code - if your code is very memory-bound, then the barrier will display memory delays, where earlier they could be hidden, but if they are more balanced, this may have a less noticeable effect.
This means that in a very memory-related kernel, you might be better off running a larger number of smaller blocks so that other blocks can run when one block bubbles onto the barrier. You will need to make sure that your filling increases as a result, and if you are sharing data between threads using shared block memory, then there is a trade-off that you need to have.
Tom
source share