How many memory latency cycles for each type of memory access in OpenCL / CUDA?

Question

How many memory latency cycles for each type of memory access in OpenCL / CUDA?

I looked at the programming guide and best practice recommendations, and he mentioned that accessing global memory takes 400-600 cycles. I have not seen many other types of memory, such as texture cache, persistent cache, shared memory. Registers have 0 memory latency.

I think the persistent cache is the same as registers if all threads use the same address in the persistent cache. In the worst case, I'm not sure.

Is shared memory the same as registers if there are no bank conflicts? If so then how is the delay?

How about a texture cache?

+5

memory latency opencl cuda nvidia

smuggledpancakes Nov 04 '10 at 14:27

source share

2

(Kepler) Tesla K20 :

: 440

L1: 48
L2: 120
: 48

L1:108
L2: 240

? , GPU Microbenchmarking. GTX 280.

Linux, node, , - . BULLX linux 8- Xeons 64 , nvcc 6.5.12. sm_20 sm_35 .

PTX ISA , , , , .

+6

the swine 09 . '15 16:29

Tom · Accepted Answer · 2010-11-04T16:40:37+0000

shared/constant/texture , . , , , , , .

, , , , , , .

. , warp (.. 32 ) , , . , , , . CUDA Profiler, .

, , . CUDA Optimization .

How many memory latency cycles for each type of memory access in OpenCL / CUDA?

More articles: