How many memory latency cycles for each type of memory access in OpenCL / CUDA?

I looked at the programming guide and best practice recommendations, and he mentioned that accessing global memory takes 400-600 cycles. I have not seen many other types of memory, such as texture cache, persistent cache, shared memory. Registers have 0 memory latency.

I think the persistent cache is the same as registers if all threads use the same address in the persistent cache. In the worst case, I'm not sure.

Is shared memory the same as registers if there are no bank conflicts? If so then how is the delay?

How about a texture cache?

+5
source share
2

shared/constant/texture , . , , , , , .

, , , , , , .

. , warp (.. 32 ) , , . , , , . CUDA Profiler, .

, , . CUDA Optimization .

+4

(Kepler) Tesla K20 :

: 440

    L1: 48
    L2: 120
: 48

    L1:108
    L2: 240

? , GPU Microbenchmarking. GTX 280.

Linux, node, , - . BULLX linux 8- Xeons 64 , nvcc 6.5.12. sm_20 sm_35 .

PTX ISA , , , , .

+6

All Articles