When synchronizing CUDA cores, the following does not work, because the kernel does not block the execution of the CPU program during its execution:
start timer kernel<<<g,b>>>(); end timer
I saw three main ways of (successfully) synchronized CUDA cores:
(1) Two CUDA eventsRecords.
float responseTime; //result will be in milliseconds cudaEvent_t start; cudaEventCreate(&start); cudaEventRecord(start); cudaEventSynchronize(start); cudaEvent_t stop; cudaEventCreate(&stop); kernel<<<g,b>>>(); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time
(2) One CUDA eventRecord.
float start = read_timer(); //helper function on CPU, in milliseconds cudaEvent_t stop; cudaEventCreate(&stop); kernel<<<g,b>>>(); cudaEventRecord(stop); cudaEventSynchronize(stop); float responseTime = read_timer() - start;
(3) deviceSynchronize instead of eventRecord. (Perhaps this is only useful when using programming in a single thread.)
float start = read_timer(); //helper function on CPU, in milliseconds kernel<<<g,b>>>(); cudaDeviceSynchronize(); float responseTime = read_timer() - start;
I experimentally confirmed that these three strategies give the same synchronization result.
Questions:
- What are the tradeoffs of these strategies? Any hidden data here?
- Besides synchronizing many cores in multiple threads, are there any advantages to using two event records and the
cudaEventElapsedTime() function?
Perhaps you can use your imagination to figure out what read_timer() does. However, it cannot interfere with the implementation of the example:
double read_timer(){ struct timeval start; gettimeofday( &start, NULL );