The CUDA C Programming Guide describes the architecture of a multiprocessor. The document states that each Fermi multiprocessor has two warp schedulers. I assume that the L2 cache is split to provide concurrent caching.
I did not consider L2 reading gaps for the Kepler architecture, but Kepler multiprocessors have four warp processors. Thus, this assumption can be verified if there are four performance counters for compiling Kepler.
source share