Time difference reported by NVVP and counters

I am running the CUDA kernel program. I observe that there is a significant difference between the time reported by the GPU and NVVP counters to execute the kernel. Why is such a difference usually observed?

0
source share
1 answer

Nsight Visual Studio Edition and Visual Profiler support two mechanisms for recording kernel duration. Both of these methods will result in a smaller and more accurate value than what is reported by CUevent / cudaEvent. The methods are as follows:

  • Parallel kernel time

    This is the default mode used by Nsight 2.x and Visual Profiler 5.0 to create a timeline. Kernel duration is defined as the time from the moment the kernel code is run on the device until it is completed. This cannot be measured using CUDA events.

  • Serialized Kernel Time

    This is the default mode used by the tools when collecting PM counts for each core. Kernel duration is defined as the time during which the GPU processes the launch request until the GPU is idle after the kernel terminates. This mode specifically disables the simultaneous execution of the kernel. In almost all cases, the declared duration will be slightly longer than the parallel trace duration of the kernel, since it includes the time when the GPU starts the first block and the time for the GPU to complete all storage storages.

  • CUDA Event Time Range

    The CUDA event time is executed by calling cu / cudaEventRecord before and after starting the kernel in the same thread. Each event record inserts a command into the GPU's push buffer. When a command reaches the GPU, it writes a timestamp to memory. You can run two event records without triggering. This allows the developer to measure GPU time between two timestamp commands. This method has the following disadvantages, which is why I encourage developers to use tools (Nsight, Visual Profiler, and CUPTI):

    • a. The time between sending a start and start record may depend on CPU overhead. Launch overhead is 5-8 μs on Linux / TCC and potentially much higher on WDDM.

    b. The GPU can switch context between recording the start of an event and executing the kernel.

    with. Starting an event record will include the costs of starting, including the time to update the driver buffers that need to be changed, copy parameters, copy texture bindings, ...

    e. The elapsed time between sending the kernel and recording the final event may affect the time.

    e. The GPU can switch context between the end of kernel execution and the recording of the final event.

    e. Misuse of events will lead to disruption of the simultaneous execution of the kernel.

The duration provided in each of these modes will provide different values. In addition, the definition of duration provided by tools and accessible through the use of events is different.

NVIDIA tools define the maximum possible duration as the time from which the GPU starts working with the kernel when the GPU finishes working with the kernel. If the developer is interested in collecting this information, he should look at the CUPTI SDK included in the tool kit.

+4
source

All Articles