The reason for the relatively low performance of CUDA shared memory when using large arrays may be due to the fact that each multiprocessor has a limited amount of shared memory available.
Each multiprocessor contains several processors; for modern devices, as a rule, 32, the number of threads in deformation. This means that in the absence of discrepancies or turns of memory, the average processing speed is 32 commands per cycle (high latency due to pipelining).
CUDA plans several blocks for the multiprocessor. Each block consists of several distortions. When a warp stops at global memory access (even shared calls have a high latency), other skews are processed. This effectively hides latency, so high latency global memory is acceptable for GPUs. To effectively hide latency, you need to perform additional deformations until the inhibited deformation continues. If all transitions stop when accessing the memory, you can no longer hide the delay.
Shared memory is allocated to units in CUDA and stored on a single-processor processor on the GPU. Each multiprocessor has a relatively small fixed amount of shared memory space. CUDA cannot plan more blocks for multiprocessors than multiprocessors can support in terms of shared memory usage and register usage. In other words, if the amount of shared memory on a multiprocessor processor is equal to X, and Y-shared memory is required for each block, CUDA will plan no more than half (X / Y) of blocks for each multiprocessor (this may be less, because there are other restrictions, such as using the register).
Ergo, by increasing the use of block memory in shared memory, you can reduce the number of active skews - occupancy - of your kernel, which will degrade performance. You should look into your kernel code by compiling the flag -Xptxas = "- v"; this should give you the ability to register, share and permanently use memory for each core. Use this data and kernel startup parameters, as well as other relevant information in the most recent version of the CUDA Employment Calculator, to determine if an activity can be affected.
EDIT:
To solve another part of your question, avoiding conflicts with shared memory access and the perfect combination of global memory accesses ... there are two aspects to this answer: latency and bandwidth. The latency of shared memory will be lower than that of global memory, since shared memory is built-in. The bandwidth will be almost the same. Ergo, if you can hide the global latency of memory access due to coalescence, there is no penalty (note: the access scheme is important here, potentially more diverse access patterns with a slight performance loss are allowed in this shared memory, so there may be advantages to using shared memory, even if you can hide all global memory latency).