When we use CUDA profiler nvvp , there are several “service” instructions related to instructions, for example:
- Divergence overhead
- Shared memory / global memory; and
- Local / global cache interception.
My questions:
- What is the reason for this overhead? AND
- How are they calculated?
- Similarly, how is the global load / storage efficiency calculated?
Attachment: I found all the formulas that calculate this overhead in the “CUDA Profiler User Guide”, packaged in the CUDA5 toolkit.
source share