In SO, there are quite a few questions about performance profiling, but I don't seem to find the whole picture. There are many problems, and most Q and A ignore all but a few, or do not justify their proposals.
What is interesting to me. If I have two functions that do the same, and Im interested in the difference in speed, does it make sense to test this without external tools, with timers, or is this influence on the results compiled in testing significantly?
I ask about this because if it is reasonable as a C ++ programmer, I want to know how to do this best, since they are much simpler than using external tools. If that makes sense, let's continue with all the possible traps:
Consider this example. The following code shows two ways to do the same thing:
#include <algorithm> #include <ctime> #include <iostream> typedef unsigned char byte; inline void swapBytes( void* in, size_t n ) { for( size_t lo=0, hi=n-1; hi>lo; ++lo, --hi ) in[lo] ^= in[hi] , in[hi] ^= in[lo] , in[lo] ^= in[hi] ; } int main() { byte arr[9] = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' }; const int iterations = 100000000; clock_t begin = clock(); for( int i=iterations; i!=0; --i ) swapBytes( arr, 8 ); clock_t middle = clock(); for( int i=iterations; i!=0; --i ) std::reverse( arr, arr+8 ); clock_t end = clock(); double secSwap = (double) ( middle-begin ) / CLOCKS_PER_SEC; double secReve = (double) ( end-middle ) / CLOCKS_PER_SEC; std::cout << "swapBytes, for: " << iterations << " times takes: " << middle-begin << " clock ticks, which is: " << secSwap << "sec." << std::endl; std::cout << "std::reverse, for: " << iterations << " times takes: " << end-middle << " clock ticks, which is: " << secReve << "sec." << std::endl; std::cin.get(); return 0; } // Output: // Release: // swapBytes, for: 100000000 times takes: 3000 clock ticks, which is: 3sec. // std::reverse, for: 100000000 times takes: 1437 clock ticks, which is: 1.437sec. // Debug: // swapBytes, for: 10000000 times takes: 1781 clock ticks, which is: 1.781sec. // std::reverse, for: 10000000 times takes: 12781 clock ticks, which is: 12.781sec.
Problems:
- What timers to use and how to get the processor time actually consumed by the code in question?
- What are the implications of compiler optimization (since these functions just change the bytes back and forth, the most efficient thing, obviously, does nothing at all)?
- Given the results presented here, do you think they are accurate (I can assure you that multiple runs give very similar results)? If so, can you explain how std :: reverse becomes so fast, given the simplicity of the user-defined function. I do not have the source code from the vC ++ version that I used for this test, but here is the implementation from GNU. It comes down to the iter_swap function, which is completely incomprehensible to me. Is this expected to work twice as fast as this custom function, and if so, why?
Contemplations:
It seems that two high-precision timers are offered: clock () and QueryPerformanceCounter (in windows). Obviously, we would like to measure the processor time of our code, and not the real time, but, as I understand it, these functions do not provide this functionality, so other processes in the system will interfere with the measurements. This page in the gnu c library seems to contradict this, but when I set a breakpoint in vC ++, the debugged process gets a lot of clock cycles even though it was paused (I did not test it under gnu). Do I have no alternative counters for this, or do we need at least special libraries or classes for this? If not, is there enough hours in this example or will there be a reason for using QueryPerformanceCounter?
What can we know for sure without debugging, disassembling and profiling? Is something really going on? Is function call strict or not? When checking in the debugger, the bytes actually change places, but I would rather learn from the theory why than from testing.
Thanks for any guidance.
Update
Thanks to a tip from tojas, the swapBytes function now works as fast as std :: reverse. I did not understand that a temporary copy in the case of a byte should only be case-sensitive and therefore very fast. Elegance can dazzle you.
inline void swapBytes( byte* in, size_t n ) { byte t; for( int i=0; i<7-i; ++i ) { t = in[i]; in[i] = in[7-i]; in[7-i] = t; } }
Thanks to a tip from ChrisW, I found that in windows you can get the actual cpu time spent on the process (read: yours) Windows Management Tool . It definitely looks more interesting than a high precision counter.