Standard features like time often have very low resolution. And yes, a good way to get around this is to repeatedly conduct your test and take the average. Please note that the first few times can be excessively slow due to hidden startup costs - especially when using complex resources such as GPUs.
For platform-specific calls, see QueryPerformanceCounter on Windows and CFAbsoluteTimeGetCurrent on OS X. (I haven't used the POSIX call to clock_gettime , but it might be worth checking out.)
Measuring GPU performance is difficult because GPUs are remote processing units that execute separate instructions — often on many parallel devices. You might want to visit the Nvidia CUDA Zone for various resources and tools that will help you measure and optimize your CUDA code. (Resources related to OpenCL are also very relevant.)
Ultimately, you want to see how quickly your results hit the screen, right? For this reason, calling for time may be enough for your needs.
source share