What happened when the entire warp thread read the same global memory?

Question

What happened when the entire warp thread read the same global memory?

I want to know what happened when all warp threads read the same 32-bit global memory address. How many memory requests are there? Is there any serialization. The GPU is a Fermi card, the programming environment is CUDA 4.0.

Also, can someone explain the concept of using buses? What is the difference between loading caching and loading without caching? I saw the concept at http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf .

+4

gpu gpgpu cuda gpu-programming

Fan zhang May 24 '12 at 7:42

source share

1 answer

Krazy glew · Accepted Answer · 2012-06-24T05:39:25+0000

All threads in warp accessing the same address in global memory

I could answer your questions from my head on the AMD GPU. For Nvidia, the search engine found the answers quickly enough.

I want to know what happened when all warp threads read the same 32-bit global memory address. How many memory requests are there? Is there any serialization. The GPU is a Fermi card, the programming environment is CUDA 4.0.

http://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Computing_Webinars_Best_Practises_For_OpenCL_Programming.pdf from 2009 says:

Coalescent:
Global memory latency: 400-600 cycles. The most important performance score!
Access to global memory by half-form threads can be combined with one transaction for a word of 8 bits, 16 bits, 32-bit, 64-bit, or two transactions for 128-bit.
Global memory can be considered as a component of aligned segments of 16 and 32 words.
Coalescence in Computing Capabilities 1.0 and 1.1:
The k-th stream in the semi-formal must get access to the k-th word in the segment; however, not all threads must participate
Combining Computing Capabilities
1.2 and 1.3:
Compatibility for any access pattern that fits in segment size

So it looks like all warp access threads will work the same as the 32-bit global memory address, anyway> = Compute Capability 1.2. But not for 1.0 and 1.1.

Your card is fine.

I must admit that I have not tested this for Nvidia. I tested it for AMD.

The difference between cache and unloaded load

To get started, see slide 4 of the presentation you are linking to, http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf .

those. a slide called "Differences between processors and GPUs" - this indicates that processors have huge caches, but GPUs do not.

A few years ago, such a slide could say that GPUs do not have caches at all. However, GPUs began to add more and more cache and / or switch more and more locally to the cache.

I’m not sure if you understand what “cache” is in computer architecture. This is a big topic, so I will give only a short answer.

Basically, the cache is similar to local memory. Both the cache and the local memory are closer to the processor or GPU than DRAM, the main memory is either the private DRAM memory of the GPU or the system memory of the central processor. The main DRAM is called Nvidia Global Memory. Slide 9 illustrates this.

Both caches and local memory are closer to the GPU than the global DRAM: on slide 9, they are displayed as being inside the same chip as the GPU, while DRAM are separate chips. This can have several good effects: latency, bandwidth, power - and, yes, bus usage (related to bandwidth).

Delay: global memory is 400-800 cycles. This means that if you have only one deformation in your application, it will only perform one memory operation every 400-800 cycles. This means that in order not to slow down, you need a lot of threads / skews that produce memory requests that can be executed in parallel, that is, they have a high MLP level (Parallelism memory level). Fortunately, graphics usually do this. Caches are closer, so they will have lower latency. Your slides do not say what it is, but in other places they say 50-200 cycles, which is 4-8X faster than in global memory. This means that fewer threads and distortions are needed to avoid slowdowns.

Bandwidth / Bandwidth: Local memory and / or cache typically have more bandwidth than the global DRAM. Your slides say that 1+ TB / s versus 177 GB / s - that is, the cache and local memory are more than 5 times faster. This higher throughput can translate to significantly higher frame rates.

Power: You save a lot of power for caching or local memory, not for global DRAM. It may not matter for a desktop gaming PC, but it is important for a laptop or tablet. In fact, it matters even for a desktop gaming PC, because fewer alerts mean that it can be faster (faster).

OK, so are local and cache similar in the above? What's the difference?

In principle, it is easier to cache the program than local memory. Very good, experienced nionja programmers are necessary for the proper management of local memory, copying material from the global memory and washing it if necessary. While the cache is much easier to manage, because you just execute the cached load, and the memory is automatically placed in the cache, where the next time it will be accessed faster.

But caches also have disadvantages.

First, they actually burn a bit more energy than local memory, or they would if there were actually separate local and global memories. However, in Fermi, local memory can be configured as a cache, and vice versa. (For many years, people at the GPU said: “We don’t need any smelly cached cached tags, and other overheads are wasteful.)

More importantly, caches tend to work on cache lines, but not all programs. This leads to the problem of using the bus you mention. If warp accesses all the words in the cache line, fine. But if warp receives only one word in the cache line, that is 1 4-byte word, and then skips 124 bytes, then 128 bytes of data are transmitted over the bus, but only 4 bytes are used. That is,> 96% of bus bandwidth is wasted. This is a low tire usage.

While the very next slide shows that non-cached loading, which can be used to load data into local memory, will only transfer 32 bytes, so “only” 28 bytes out of 32 are lost. In other words, non-cache loads can be 4 times more efficient, 4 times faster than cached loads.

Then why not use non-cache loads? Because they are more difficult to program - this requires expert ninja programmers. And caches work pretty well a lot of time.

So, instead of paying your ninja expert programmers, you need to spend a lot of time optimizing the entire code to use downloads without a cache and manually managed local memory - instead, you do simple material using cached loads, and you allow highly paid programmers ninjas concentrate on the fact that the cache is not working well.

In addition: no one likes to recognize him, but often the cache works better than ninja programmers.

Hope this helps. Utilization of bandwidth, power and bus: in addition to reducing

What happened when the entire warp thread read the same global memory?

All threads in warp accessing the same address in global memory

The difference between cache and unloaded load

More articles: