visual profiler is a great tool to test your work. Once you have the functionally correct code, run it from the visual profiler. For example, on Linux, assuming you have an X session, just start nvvp from a terminal window. Then you will be presented with a wizard that will offer you an application to profile along with any command line parameters.
The profiler will then run a basic launch of your application to collect statistics. You can also choose a more complex statistical collection (requiring additional runs), and one of them will be memory usage statistics. It will report memory usage as a percentage of the peak, and will also warn you about what it considers to be low usage that deserves your attention.
If you have a usage number above 50%, your application probably works as you expect. If you have a small amount, you probably missed some details of the merger. It will report statistics separately for reading and writing to memory. To get 100% or close to it, you also need to make sure that your combined reads and writes from warp are aligned at 128 byte boundaries.
A common mistake in these situations is to use the threadIdx.y variable, which will be the fastest-changing index. I don’t think you made this mistake. for example, this is a common error shared[threadIdx.x][threadIdx.y] , because it is often the way we think about it in C. But the threads are grouped first along the x axis, so we want to use shared[threadIdx.y][threadIdx.x] or something like that. If you make this mistake, your code will still be functionally correct, but you will get low percentage usage numbers in the profiler, for example about 12% or even 3%.
As already mentioned, in order to get more than 50% and get closer to 100%, you will want to make sure that not only all your flow requests are adjacent, but also aligned on the 128B border. Because of the L1 / L2 caches, these are not hard and fast rules, but recommendations. Cache can mitigate some errors to some extent.
Robert Crovella
source share