In the nvvp CUDA profiler, what does shared / global memory mean? How is it calculated?

When we use CUDA profiler nvvp , there are several “service” instructions related to instructions, for example:

  • Divergence overhead
  • Shared memory / global memory; and
  • Local / global cache interception.

My questions:

  • What is the reason for this overhead? AND
  • How are they calculated?
  • Similarly, how is the global load / storage efficiency calculated?

Attachment: I found all the formulas that calculate this overhead in the “CUDA Profiler User Guide”, packaged in the CUDA5 toolkit.

+6
source share
1 answer

You can find the answers to your question here:

Why does the CUDA Profiler indicate repeat commands: 82%! = Global repeat play + local play + collaborative play?

Reproduced instructions (%) . This gives the percentage of instructions that are played at runtime. Instructions reproduced is the difference between the number of instructions that are actually issued by the hardware, the number of instructions that are executed by the kernel. Ideally, this should be zero. This is calculated as 100 * (issued instructions - instructions) / issued instructions

Global memory repetition (%) The percentage of repeated commands caused by global memory access. It computes as 100 * (l1 global loading error) / instructions issued

Local memory retry (%) Percentage of retry instructions caused by accessing local memory. This computes as 100 * (l1 local load miss + l1 local store miss) / issued instructions

Repeated exchange of conflict in the bank (%) Percentage of replay of instructions caused by conflicts of the shared memory bank. This is calculated as 100 * (l1 total conflict) / issued instructions

+2
source

All Articles