CUDA Kernel Timing Strategies: Pros and Cons?

When synchronizing CUDA cores, the following does not work, because the kernel does not block the execution of the CPU program during its execution:

start timer kernel<<<g,b>>>(); end timer 

I saw three main ways of (successfully) synchronized CUDA cores:

(1) Two CUDA eventsRecords.

 float responseTime; //result will be in milliseconds cudaEvent_t start; cudaEventCreate(&start); cudaEventRecord(start); cudaEventSynchronize(start); cudaEvent_t stop; cudaEventCreate(&stop); kernel<<<g,b>>>(); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time 

(2) One CUDA eventRecord.

 float start = read_timer(); //helper function on CPU, in milliseconds cudaEvent_t stop; cudaEventCreate(&stop); kernel<<<g,b>>>(); cudaEventRecord(stop); cudaEventSynchronize(stop); float responseTime = read_timer() - start; 

(3) deviceSynchronize instead of eventRecord. (Perhaps this is only useful when using programming in a single thread.)

 float start = read_timer(); //helper function on CPU, in milliseconds kernel<<<g,b>>>(); cudaDeviceSynchronize(); float responseTime = read_timer() - start; 

I experimentally confirmed that these three strategies give the same synchronization result.


Questions:

  • What are the tradeoffs of these strategies? Any hidden data here?
  • Besides synchronizing many cores in multiple threads, are there any advantages to using two event records and the cudaEventElapsedTime() function?

Perhaps you can use your imagination to figure out what read_timer() does. However, it cannot interfere with the implementation of the example:

 double read_timer(){ struct timeval start; gettimeofday( &start, NULL ); //you need to include <sys/time.h> return (double)((start.tv_sec) + 1.0e-6 * (start.tv_usec))*1000; //milliseconds } 
+1
source share
2 answers

You seem to have ruled out most of the differences by saying that they all give the same result for the relatively simple case that you showed (probably not quite true, but I understand what you mean), and "Beyond time (complex sequences ) ... "where the first case is clearly better.

One possible difference might be portability between windows and linux. I believe your read_timer example function is Linux oriented. You can probably create the read_timer function, which is portable, but the cuda event system (method 1) is portable as is.

+1
source

Option (1) uses cudaEventRecord for CPU time. This is very inefficient, and I would not encourage the use of cudaEventRecord for this purpose. cudaEventRecord can be used for the load time of the GPU buffer to start the kernel as follows:

 float responseTime; //result will be in milliseconds cudaEvent_t start; cudaEvent_t stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start); kernel<<<g,b>>>(); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time 

The code needs to be slightly modified if you send multiple work items to multiple threads. I would recommend reading the answer. Time difference reported by NVVP and counters

Options (2) and (3) are similar for this example. Option (2) may be more flexible.

0
source

All Articles