I programmed the CUDA core with my own. Compare with the processor code, my kernel code is 10 times faster than CPUs.
But I have a question with my experiments.
Is my program optimized using all GPU cores, the correct use of shared memory, enough registers, enough busy?
How can I evaluate the performance of kernel code?
How can I calculate the maximum CUDA bandwidth theoretically?
Is it true that the comparison between the GFLOPS CPU and the GFLOPS and GFLOPS GPUs is transparent theoretical performance?
Thanks in advance.
bongmo.kim
source share