Discrepancy in the management of cuda

Let's say I have 3 shared memory arrays: ab c. I’m not sure that the subsequent arrangement of the threads will lead to a divergence of control or not

if (threadIdx < 64) { if (threadIdx == 1) for (int i = 0; i < N; i++) c += a[threadIdx]*a[threadIdx]; else for (int i = 0; i < N; i++) c += a[threadIdx]*b[threadIdx]; } 

if so, how badly will it affect performance? Is there an effective way to solve the problem? thanks

+4
source share
2 answers

If there are several threads in a block, I would expect a discrepancy in one deformation of each block (depending on which block contains thread 1).

But the difference between your two loops is only in which memory to access, and not in the instructions. So, I would do it instead ...

 if (threadIdx.x < 64) { //this conditional might diverge if (threadIdx.x == 1) ptr = a; else ptr = b; //but obviously this part will not for (int i = 0; i < N; i++) c += a[threadIdx]*ptr[threadIdx]; } 
+8
source

Depending on the size of your block, the first condition threadIdx.x < 64 (pay attention to .x ) may not cause any discrepancies at all. For example, if you have a block with dimensions (128,1,1) , then the first two warps (groups of 32 threads that are executed in blocking mode) will go into the if block, while the last two will bypass it. As the whole framework goes anyway, there is no discrepancy.

A condition like threadIdx.x == 1 will lead to a discrepancy, but it will have a very modest cost. Indeed, in many cases, CUDA will be able to implement a conditional expression with a single command. For example, operations such as min , max and abs will typically be implemented with a single instruction and will not completely diverge. You can read about such instructions in the PTX Manual .

In general, you should not worry too much about the modest number of discrepancies in the control flow, as described above. Where divergence will bite you in situations like

 if (threadIdx.x % 4 == 0) // do expensive operation else if (threadIdx.x % 4 == 1) // do expensive operation else if (threadIdx.x % 4 == 2) // do expensive operation else // do expensive operation 

where the "expensive operation" would be one that would require 10 or 100 instructions. In this case, the discrepancy caused by the if would reduce the efficiency by 75%.

Keep in mind that the discrepancy between the threads is much smaller than (1) choosing a high-level algorithm and (2) memory localization / coalescence. Very few CUDA programmers should never worry about the odds in your examples.

+10
source

All Articles