How to determine why a CUDA thread is blocking

I am trying to switch the algorithm that I wrote from a Tesla T10 processor (computational ability 1.3) to a Tesla M2075 (computational ability 2.0). During the switch, I was surprised to find that my algorithm slowed down. I analyzed it and found that it looks like cuda threads are blocking on the new machine. My algorithm has three main tasks that can be divided and run in parallel: memory reorganization (which can be performed on the processor), copying memory from the host to the device, and running the kernel on the device. On an old machine, thread separation allowed 3 tasks to overlap like this (all screenshots from NVidia Visual Profiler): Correct stream execution

However, threads are blocked on the new machine before starting the CPU calculation until the previous kernel is executed, which can be seen here: 3 stream execution

You can see the top line, all the orange blocks are cudaStreamSynchronize calls, which are blocked until the previous kernel is executed, even if this kernel is in a completely different thread. It seems to work for the first run through the threads and parallelize correctly, but after that the problem starts, so I thought that maybe this is blocking something, and I tried to increase the number of threads that gave me this result: 12 stream execution

Here you can see that for some reason only the first 4 threads are blocked, after which it starts to parallelize correctly. As a last attempt, I tried to crack it using only the first 4 threads at a time, and then switched to using later threads, but still did not work, and it still stalled every 4 threads, while simultaneously letting other threads execute at the same time: 10 stream execution

So, I'm looking for any ideas as to what might cause this problem and how to diagnose it. I screamed at my code and I don't think this is a mistake, although I could be wrong. Each stream is encapsulated in its own class and has a link to only one cudaStream_t, which is a member of this class, so I don’t see how it can refer to another stream and block it.

Are there any changes in the way threads work between versions 1.3 and 2.0 that I don’t know about? Could there be something with shared memory that is not freed, and she will have to wait? Any ideas on how to diagnose this problem are welcome, thanks.

+7
source share
1 answer

I cannot be absolutely sure without seeing the code, but it looks like you might have a problem with the order in which you queue your teams. There is a slight difference in how the 1.x and 2.x capabilities are processed by threads because 2.x devices can run multiple cores at the same time and simultaneously process both HTOD and DtoH.

If you queue your teams in the order of all HtoDs, all calculations, all DtoHs, you will have good results on Tesla maps (1060, etc.).

If you order them, copy HtoD, calculate, copy DtoH, copy HtoD ... etc. You will have good results at Fermi.

Kepler works equally well in both cases. It matters in streams in both Tesla and Fermi cases, I suggest reading this NVIDIA entry for more information. Overlapping between threads can be an extremely difficult problem, I wish you success. If you need more help, a general idea of ​​the order in which you perform operations in the queue would be extremely helpful.

+3
source

All Articles