L2 Cache in NVIDIA Fermi

When I looked at the name of the performance counters in the NVIDIA Fermi architecture (the Compute_profiler.txt file in the doc cuda folder), I noticed that there are two performance counters for skipping the L2 cache: l2_subp0_read_sector_misses and l2_subp1_read_sector_misses. They said that these are two slices of L2.

Why do they have two slices of L2? Is there a connection with streaming multiprocessor architecture? What will be the effect of this separation on productivity?

thanks

+4
source share
2 answers

I do not think there is a direct connection with the streaming multiprocessor.

I just think that slice is equivalent to bank memory.

Just summarize the values ​​of these two to get the β€œresulting” L2 reading omissions.

+1
source

The CUDA C Programming Guide describes the architecture of a multiprocessor. The document states that each Fermi multiprocessor has two warp schedulers. I assume that the L2 cache is split to provide concurrent caching.

I did not consider L2 reading gaps for the Kepler architecture, but Kepler multiprocessors have four warp processors. Thus, this assumption can be verified if there are four performance counters for compiling Kepler.

+1
source

All Articles