Depending on the size of your block, the first condition threadIdx.x < 64 (pay attention to .x ) may not cause any discrepancies at all. For example, if you have a block with dimensions (128,1,1) , then the first two warps (groups of 32 threads that are executed in blocking mode) will go into the if block, while the last two will bypass it. As the whole framework goes anyway, there is no discrepancy.
A condition like threadIdx.x == 1 will lead to a discrepancy, but it will have a very modest cost. Indeed, in many cases, CUDA will be able to implement a conditional expression with a single command. For example, operations such as min , max and abs will typically be implemented with a single instruction and will not completely diverge. You can read about such instructions in the PTX Manual .
In general, you should not worry too much about the modest number of discrepancies in the control flow, as described above. Where divergence will bite you in situations like
if (threadIdx.x % 4 == 0) // do expensive operation else if (threadIdx.x % 4 == 1) // do expensive operation else if (threadIdx.x % 4 == 2) // do expensive operation else // do expensive operation
where the "expensive operation" would be one that would require 10 or 100 instructions. In this case, the discrepancy caused by the if would reduce the efficiency by 75%.
Keep in mind that the discrepancy between the threads is much smaller than (1) choosing a high-level algorithm and (2) memory localization / coalescence. Very few CUDA programmers should never worry about the odds in your examples.
source share