Divergence divergence, CUDA and Kinetic Monte Carlo

So, I have a code that uses Kinetic Monte Carlo on a grid to simulate something. I use CUDA to run this code on my GPU (although I believe the same question applies to OpenCl as well).

This means that I divide my lattice into small sublattices, and each thread runs on one of them. Since I am doing KMC, each thread has this code:

While(condition == true){ *Grab a sample u from U[0,1]* for(i = 0; i < 100;i++){ *Do some stuff here to generate A* if(A > u){ *Do more stuff here, which could include updates to global memory* break(); } } } 

A is different for different threads, and so u and 100 are just a random number. In the code, it can be 1000 or even 10000.

So, will we not have branch divergence when the time comes for the flow to go through this if? How much can this affect performance? I know that the answer depends on the code inside the if-sentence, but how will this scale increase when I add more and more threads?

Any reference to how I can evaluate the loss / gain in performance is also welcome.

Thanks!

+3
source share
1 answer

The GPU runs threads in groups of 32 threads, called skews. Divergence can occur only at the base. That way, if you can streamline your threads so that the if condition if evaluated the same throughout warp, there is no discrepancy.

When there is a discrepancy in if , conceptually, the GPU simply ignores the results and memory requests from threads in which the if condition was false.

So, let's say that if evaluates to true for 10 threads in a certain strain. Inside if potential computational performance of warp decreases from 100% to 10/32 * 100 = 31%, since the 22 threads that were disabled using if could do the job, but now just take up space in the warp.

As soon as you exit if , disabled threads are turned back on, and warp works at 100% potential computing performance.

An if-else behaves in much the same way. When warp falls into else , the threads that were included in the if become disabled, and those that were disabled become enabled.

In a for loop, which loops a different number of times for each thread in warp, threads are disabled because their number of iterations reaches their given numbers, but the deformation as a whole should continue until the thread with the highest number of iterations is completed.

Considering the potential memory bandwidth, the situation is a bit more complicated. If the algorithm is memory related, there may not be so much or some performance loss due to the difference in warp, as the number of memory transactions can be reduced. If each thread in warp is read from a completely different place in the global memory (a bad situation for the GPU), time will be saved for each of the disabled threads, as their memory transactions should not be executed. On the other hand, if threads were read from an array that was optimized for access to the GPU, multiple threads would share the results from a single transaction. In this case, the values ​​intended for disabled threads were read from memory and then discarded along with the calculations that the disabled thread could do.

So, now you probably have enough review to be able to make fairly sound judgments about how the big difference in warp will affect your performance. The worst case is when only one thread is active in warp. Then you get 1/32 = 3.125% of the potential for performance related to computing. The best case is 31/32 = 96.875%. For if , which is completely random, you get 50%. And, as already mentioned, memory performance is associated with a change in the number of required memory transactions.

+12
source

All Articles