Typically, good shared access can be achieved when neighboring threads access neighboring cells in memory. So, if tid contains the index of your stream, then access:
arr[tid] --- gives perfect coalescencearr[tid+5] --- almost perfect, possibly inconsistentarr[tid*4] --- not as good anymore, due to flawsarr[random(0..N)] --- horrible!
I speak from the point of view of the CUDA programmer, but similar rules apply in other places, even in simple CPU programming, although the influence there is not so great.
"But I have so many arrays, everyone has about 2 or 3 times more than the number of my threads, and using a template like" arr [tid * 4] "is inevitable. What can be cured of this?"
If the offset is a multiple of some higher 2-power (for example, 16 * x or 32 * x), this is not a problem. So, if you need to process a rather long array in a for loop, you can do something like this:
for (size_t base=0; i<arraySize; i+=numberOfThreads) process(arr[base+threadIndex])
(it is indicated above that the size of the array is a multiple of the number of threads)
So, if the number of threads is a multiple of 32, memory access will be good.
Write it down again: I am speaking from the perspective of a CUDA programmer. For different GPUs / environments, you may need fewer or more threads for a perfect merge with memory access, but similar rules should apply.
Is "32" related to warp size available in parallel to global memory?
Although not directly, there is some connection. The global memory is divided into segments 32, 64 and 128 bytes, which are accessed half-degree. The more segments you get for a given memory retrieval command, the longer it takes. You can read more detailed information in the “CUDA Programming Guide”, in this chapter there is a whole chapter: “5.3 Maximum increase in memory bandwidth”.
In addition, I heard a little about shared memory in order to localize access to memory. Is it preferable for combining memory or has its difficulties? Shared memory is much faster because it lies on the chip, but its size is limited. The memory is not segmented as global, you can access it almost by accident, without any restrictions. However, there are memory bank strings with a width of 4 bytes (the size of a 32-bit int). The memory address that each access stream must differ modulo 16 (or 32, depending on the GPU). Thus, the address [tid*4] will be much slower than [tid*5] , because the first one has access only to banks 0, 4, 8, 12 and the last 0, 5, 10, 15, 4, 9 , 14, ... (bank id = address modulo 16).
Again, you can read more in the CUDA Programming Guide.