CUDA Compute Capability 2.0. Global memory access scheme

From CUDA Compute Capability 2.0 (Fermi) access to global memory works through the L68 cache of 768 KB. It seems that the developer no longer cares about global memory banks. But global memory is still very slow, so the correct access scheme is important. Now you need to use / reuse L2 as much as possible. And my question is how? I would be grateful for detailed information on how L2 works and how I organize and access global memory if I need, for example, 100-200 array elements per stream.

+4
source share
1 answer

The L2 cache helps in some cases, but it does not eliminate the need for global memory sharing. In a nutshell, unified access means that for a given read (or write) command, individual threads in a deformation read (or write) adjacent neighboring locations in global memory, preferably aligned as a group on a 128-byte boundary. This will lead to the most efficient use available memory bandwidth.

In practice, this is often difficult. For instance:

int idx=threadIdx.x + (blockDim.x * blockIdx.x); int mylocal = global_array[idx]; 

will give combined (read) access over all threads in warp, assuming that global_array is allocated in the usual way, using cudaMalloc in global memory. This type of access provides 100% utilization of the available memory bandwidth.

The key conclusion is that memory transactions usually occur in 128-byte blocks, which, as it turns out, are the size of the cache line. If you request at least one of the bytes in a block, the entire block will be read (and usually stored in L2). If you later read other data from this block, it will usually be served with L2, unless it has been superseded by another memory operation. This means the following sequence:

 int mylocal1 = global_array[0]; int mylocal2 = global_array[1]; int mylocal3 = global_array[31]; 

all are typically served from a single 128-byte block. The first read for mylocal1 will read 128 bytes. A second read for mylocal2 usually served from a cached value (in L2 or L1), rather than by starting another read from memory. However, if the algorithm can be modified accordingly, it is better to read all your data adjacent to several threads, as in the first example. It can be just smart data organization, for example, using array structures rather than array structures.

In many ways, this is similar to the behavior of the processor cache. The concept of a cache line is similar to the concept of serving requests from a cache.

Fermi L1 and L2 can support writeback and writeback. L1 is available on the basis of SM-SM and is configured with shared memory of either 16 KB L1 (and 48 KB SM) or 48 KB L1 (and 16 KB SM). L2 is unified throughout the device and is 768 KB.

Some suggestions I would like to offer are not to assume that the L2 cache only corrects inaccurate memory accesses. The GPU cache is much smaller than the equivalent processor caches, so it’s easier to get into the problem. General advice is just code, as if there were no caches there. Instead of processor-oriented strategies, such as blocking the cache, it is usually best to focus on coding to generate aggregate accesses, and then perhaps use shared memory in some specific cases. Then for the inevitable cases when we cannot provide perfect access to memory in all situations, we allow caches to provide their advantages.

You can get a more detailed guide by looking at some of the available NVIDIA webinars . For example, a webinar on the use and strategy of global memory (and slides ) or a webinar on CUDA Shared Memory and Cache will be instructive for this topic. You can also read the device memory section in the CUDA C Programming Guide .

+9
source

All Articles