CUDA: is global memory access faster than shared memory? Also, does a large array of shared memory allocate a slow program?

Question

CUDA: is global memory access faster than shared memory? Also, does a large array of shared memory allocate a slow program?

I do not find speed improvements with shared memory on the NVIDIA Tesla M2050 with approximately 49 KB of shared memory per block. Actually, if I allocated a large char array in shared memory, it slows down my program. for example

__shared__ char database[49000];

gives me a slower work time than

 __shared__ char database[4900];

The program only gets access to the first 100 characters of the database, so no extra space is required. I can’t understand why this is happening. Any help would be greatly appreciated. Thanks.

+7

cuda

Ross Feb 08 '12 at 15:19

source share

3 answers

In addition, if you increase the total memory for each block, CUDA will plan the grids with less parallel blocks, so all of them will have enough shared memory, so it will reduce parallelism and increase the execution time.

+2

perreal Feb 08 '12 at 15:32

source share

The resources available on gpu are limited. The number of blocks working simultaneously is approximately inversely proportional to the size of shared memory per block.

This explains why the runtime runs slower when starting a kernel that uses a really large amount of shared memory.

+1

Pavan Yalamanchili Feb 08 '12 at 15:33

source share

Patrick87 · Accepted Answer · 2012-02-08T15:50:03+0000

The reason for the relatively low performance of CUDA shared memory when using large arrays may be due to the fact that each multiprocessor has a limited amount of shared memory available.

Each multiprocessor contains several processors; for modern devices, as a rule, 32, the number of threads in deformation. This means that in the absence of discrepancies or turns of memory, the average processing speed is 32 commands per cycle (high latency due to pipelining).

CUDA plans several blocks for the multiprocessor. Each block consists of several distortions. When a warp stops at global memory access (even shared calls have a high latency), other skews are processed. This effectively hides latency, so high latency global memory is acceptable for GPUs. To effectively hide latency, you need to perform additional deformations until the inhibited deformation continues. If all transitions stop when accessing the memory, you can no longer hide the delay.

Shared memory is allocated to units in CUDA and stored on a single-processor processor on the GPU. Each multiprocessor has a relatively small fixed amount of shared memory space. CUDA cannot plan more blocks for multiprocessors than multiprocessors can support in terms of shared memory usage and register usage. In other words, if the amount of shared memory on a multiprocessor processor is equal to X, and Y-shared memory is required for each block, CUDA will plan no more than half (X / Y) of blocks for each multiprocessor (this may be less, because there are other restrictions, such as using the register).

Ergo, by increasing the use of block memory in shared memory, you can reduce the number of active skews - occupancy - of your kernel, which will degrade performance. You should look into your kernel code by compiling the flag -Xptxas = "- v"; this should give you the ability to register, share and permanently use memory for each core. Use this data and kernel startup parameters, as well as other relevant information in the most recent version of the CUDA Employment Calculator, to determine if an activity can be affected.

EDIT:

To solve another part of your question, avoiding conflicts with shared memory access and the perfect combination of global memory accesses ... there are two aspects to this answer: latency and bandwidth. The latency of shared memory will be lower than that of global memory, since shared memory is built-in. The bandwidth will be almost the same. Ergo, if you can hide the global latency of memory access due to coalescence, there is no penalty (note: the access scheme is important here, potentially more diverse access patterns with a slight performance loss are allowed in this shared memory, so there may be advantages to using shared memory, even if you can hide all global memory latency).

CUDA: is global memory access faster than shared memory? Also, does a large array of shared memory allocate a slow program?

More articles: