I am trying to switch the algorithm that I wrote from a Tesla T10 processor (computational ability 1.3) to a Tesla M2075 (computational ability 2.0). During the switch, I was surprised to find that my algorithm slowed down. I analyzed it and found that it looks like cuda threads are blocking on the new machine. My algorithm has three main tasks that can be divided and run in parallel: memory reorganization (which can be performed on the processor), copying memory from the host to the device, and running the kernel on the device. On an old machine, thread separation allowed 3 tasks to overlap like this (all screenshots from NVidia Visual Profiler): 
However, threads are blocked on the new machine before starting the CPU calculation until the previous kernel is executed, which can be seen here: 
You can see the top line, all the orange blocks are cudaStreamSynchronize calls, which are blocked until the previous kernel is executed, even if this kernel is in a completely different thread. It seems to work for the first run through the threads and parallelize correctly, but after that the problem starts, so I thought that maybe this is blocking something, and I tried to increase the number of threads that gave me this result: 
Here you can see that for some reason only the first 4 threads are blocked, after which it starts to parallelize correctly. As a last attempt, I tried to crack it using only the first 4 threads at a time, and then switched to using later threads, but still did not work, and it still stalled every 4 threads, while simultaneously letting other threads execute at the same time: 
So, I'm looking for any ideas as to what might cause this problem and how to diagnose it. I screamed at my code and I don't think this is a mistake, although I could be wrong. Each stream is encapsulated in its own class and has a link to only one cudaStream_t, which is a member of this class, so I donβt see how it can refer to another stream and block it.
Are there any changes in the way threads work between versions 1.3 and 2.0 that I donβt know about? Could there be something with shared memory that is not freed, and she will have to wait? Any ideas on how to diagnose this problem are welcome, thanks.
Jsoet
source share