The CUDA Programming Guide states that
"Bandwidth is one of the most important factors affecting performance. Almost all code changes should be made in the context of how they affect bandwidth."
Next, the theoretical throughput is calculated, which is about hundreds of gigabytes per second. I do not understand why the number of bytes that can be read / written to global memory is a reflection of how well the kernel is optimized.
If I have a kernel that intensively calculates data stored in shared memory and / or registers, with only one reading at the beginning and writing out at the end of and into global memory, of course, the effective bandwidth will be small, while the kernel itself Can be very effective.
Can anyone else explain the bandwidth in this context?
thank
zenna source
share