The GPU launches threads in groups of 32, called skews. Whenever different threads in warp go through different paths in the code, the GPU must run the whole warp several times, once for each code path.
To deal with this problem, the so-called warp deviation, you want to streamline your threads so that the threads in a given warp go through as few different code paths as possible. When you do this, pretty much you just have to bite the bullet and accept the performance loss caused by any remaining deformation. In some cases, perhaps not all you can do to streamline your threads. If so, and if different code paths are a large part of your kernel or overall workload, the task may not be suitable for the GPU.
It doesn't matter how you implement the different code paths. if-else , switch , prediction (in PTX or SASS), branch tables or something else - if it comes to threads in warp running on different paths, you get a performance hit.
It also doesn't matter how many threads go through each path, just the total number of different paths in the warp.
Below is another answer , which will be more detailed.
source share