I wrote some simple tests that perform a series of global memory accesses. When I measure the L1 and L2 cache statistics, I found that (on a GTX580 with 16 SM):
total L1 cache misses * 16 != total L2 cache queries
Indeed, the right side is much higher than the left (about five times). I heard that some variation in values can be placed in L2. But my kernel has less than 28 registers, not so many. I wonder what will be the source of this difference? Or am I misinterpreting the meaning of these performance counters?
thank
cuda Programming Guide Section G.4.2:
. -dlcm, L1, L2 (-Xptxas -dlcm = ca) ( ) L2 (-Xptxas -dlcm = cg). - 128 128- . , L1, L2, 128- , , L2, 32 . L2 , , , , .
, L1 128 , L2 - 32 .