In the nvvp CUDA profiler, what does shared / global memory mean? How is it calculated?

Question

In the nvvp CUDA profiler, what does shared / global memory mean? How is it calculated?

When we use CUDA profiler nvvp , there are several “service” instructions related to instructions, for example:

Divergence overhead
Shared memory / global memory; and
Local / global cache interception.

My questions:

What is the reason for this overhead? AND
How are they calculated?
Similarly, how is the global load / storage efficiency calculated?

Attachment: I found all the formulas that calculate this overhead in the “CUDA Profiler User Guide”, packaged in the CUDA5 toolkit.

+6

performance cuda overhead

troore Nov 25 '12 at 14:27

source share

1 answer

Benc · Answer 1 · 2012-12-10T18:18:34+0000

You can find the answers to your question here:

Why does the CUDA Profiler indicate repeat commands: 82%! = Global repeat play + local play + collaborative play?

Reproduced instructions (%) . This gives the percentage of instructions that are played at runtime. Instructions reproduced is the difference between the number of instructions that are actually issued by the hardware, the number of instructions that are executed by the kernel. Ideally, this should be zero. This is calculated as 100 * (issued instructions - instructions) / issued instructions
Global memory repetition (%) The percentage of repeated commands caused by global memory access. It computes as 100 * (l1 global loading error) / instructions issued
Local memory retry (%) Percentage of retry instructions caused by accessing local memory. This computes as 100 * (l1 local load miss + l1 local store miss) / issued instructions
Repeated exchange of conflict in the bank (%) Percentage of replay of instructions caused by conflicts of the shared memory bank. This is calculated as 100 * (l1 total conflict) / issued instructions

In the nvvp CUDA profiler, what does shared / global memory mean? How is it calculated?

More articles: