How to evaluate the performance of CUDA?

I programmed the CUDA core with my own. Compare with the processor code, my kernel code is 10 times faster than CPUs.

But I have a question with my experiments.

Is my program optimized using all GPU cores, the correct use of shared memory, enough registers, enough busy?

How can I evaluate the performance of kernel code?

How can I calculate the maximum CUDA bandwidth theoretically?

Is it true that the comparison between the GFLOPS CPU and the GFLOPS and GFLOPS GPUs is transparent theoretical performance?

Thanks in advance.

+5
source share
2 answers

Is my program optimized using all GPU cores, the correct use of shared memory, enough registers, enough busy?

To find out, you use one of the CUDA profilers. See How to profile and optimize CUDA kernels?

How can I calculate the maximum CUDA bandwidth theoretically?

This math is a bit involved, different for each architecture, and easily mistaken. It is better to see the numbers in the specifications for your chip. Wikipedia has tables like this one for GTX500 cards . For example, the table shows that the GTX580 has a theoretical peak throughput of 192.4 GB / s and calculates the throughput of 1581.1GFLOP.

Is it true that the comparison between the GFLOPS CPU and the GFLOPS and GFLOPS GPUs is transparent theoretical performance?

If I understand correctly, you ask whether it is possible to compare the number of theoretical peak GFLOPs on the GPU with the corresponding number on the CPU. When comparing these numbers, there are a few things to consider:

  • Older GPUs do not support double precision (DP) floating point, only one precision (SP).

  • GPUs supporting DP do this with significant performance degradation compared to SP. The number of GFLOPs that I quoted above was for SP. On the other hand, the numbers quoted for processors are often for DP, and the difference between the SP and DP performance on the processor is less.

  • CPU quotes can be for bets that are achievable only using vectorized SIMD instructions (one instruction, several data), and as a rule, it is very difficult to write algorithms that can come close to the theoretical maximum (and maybe they should be written at the meeting ) Sometimes the quotation marks of the CPU are designed to combine all the computing resources available using different types of instructions, and it is often almost impossible to write a program that can use them all at the same time.

  • The rates indicated for GPUs suggest that you have enough parallel work to saturate the GPU and that your algorithm is not bandwidth-related.

+5
source

The preferred performance indicator is elapsed time. GFLOPs can be used as a comparison method, but it is often difficult to compare between compilers and architectures due to differences in instruction set, compiler code generation and the way FLOP is counted.

The best way is application runtime. For CUDA code, you must specify all the code that will occur during each run. This includes memory copies and synchronization.

Nsight Visual Studio Edition and Visual Profiler provide the most accurate measurement of each operation. Nsight Visual Studio Edition provides theoretical throughput and FLOP values ​​for each device. In addition, the Achieved FLOPs experiment can be used to capture the FLOP score for both single and double precision.

+3
source

All Articles