You do not use persistent memory.
- A single read from read-only memory can be transferred to half the deformation (and not in your case, since each loading of a stream is from its own tid).
- The read-only memory is cached (not used in your case, since you only read once from each position in the read-only memory array).
Since each thread in the half-step does one read for different data, 16 different reads become serialized, taking 16 times more time to place the request.
If they read from global memory, the request is executed simultaneously, combined. This is why your example of global memory is better than read-only memory.
Of course, this conclusion may vary depending on the devices of computational capability 2.x with cache L1 and L2.
Hello!
source share