Duration of CUDA Application Using Events

I use the following two functions for different times (cudaMemcpyHtoD, kernel execution, cudaMemcpyDtoH) of my code (which includes multiprocessor, parallel kernels on one GPU, sequential execution of kernels, etc.). As far as I understand, these functions will record the time elapsed between events, but I assume that inserting events by the code lifetime can lead to overhead and inaccuracies. I would like to hear criticisms, general recommendations for improving these functions and a warning about them.

//Create event and start recording cudaEvent_t *start_event(int device, cudaEvent_t *events, cudaStream_t streamid=0) { cutilSafeCall( cudaSetDevice(device) ); cutilSafeCall( cudaEventCreate(&events[0]) ); cutilSafeCall( cudaEventCreate(&events[1]) ); cudaEventRecord(events[0], streamid); return events; } //Return elapsed time and destroy events float end_event(int device, cudaEvent_t *events, cudaStream_t streamid=0) { float elapsed = 0.0; cutilSafeCall( cudaSetDevice(device) ); cutilSafeCall( cudaEventRecord(events[1], streamid) ); cutilSafeCall( cudaEventSynchronize(events[1]) ); cutilSafeCall( cudaEventElapsedTime(&elapsed, events[0], events[1]) ); cutilSafeCall( cudaEventDestroy( events[0] ) ); cutilSafeCall( cudaEventDestroy( events[1] ) ); return elapsed; } 

Using:

 cudaEvent_t *events; cudaEvent_t event[2]; //0 for start and 1 for end ... events = start_event( cuda_device, event, 0 ); <Code to time> printf("Time taken for the above code... - %f secs\n\n", (end_event(cuda_device, events, 0) / 1000) ); 
+4
source share
1 answer

First, if this is for production code, you might want to do something between the second cudaEventRecord and cudaEventSynchronize (). Otherwise, it may reduce the ability of your application to overlap the GPU and processor.

Next, I separate the creation of events and their destruction from the recording of events. I'm not sure about the price, but overall you can not often refer to cudaEventCreate and cudaEventDestroy.

What I would do is create a class like this

 class EventTimer { public: EventTimer() : mStarted(false), mStopped(false) { cudaEventCreate(&mStart); cudaEventCreate(&mStop); } ~EventTimer() { cudaEventDestroy(mStart); cudaEventDestroy(mStop); } void start(cudaStream_t s = 0) { cudaEventRecord(mStart, s); mStarted = true; mStopped = false; } void stop(cudaStream_t s = 0) { assert(mStarted); cudaEventRecord(mStop, s); mStarted = false; mStopped = true; } float elapsed() { assert(mStopped); if (!mStopped) return 0; cudaEventSynchronize(mStop); float elapsed = 0; cudaEventElapsedTime(&elapsed, mStart, mStop); return elapsed; } private: bool mStarted, mStopped; cudaEvent_t mStart, mStop; }; 

Note. I did not include cudaSetDevice () - it seems to me that this should be left for the code that uses this class to make it more flexible. The user will need to guarantee that the same device will be active when calling start and stop.

PS. Not for NVIDIA for CUTIL, production code should be used - it is used just for convenience in our examples and is not as rigorously tested or optimized as the CUDA libraries and compilers themselves. I recommend that you extract things like cutilSafeCall () into your own libraries and headers.

+9
source

All Articles