The best way to check C ++ code speed without a profiler, or does it make sense to try?

In SO, there are quite a few questions about performance profiling, but I don't seem to find the whole picture. There are many problems, and most Q and A ignore all but a few, or do not justify their proposals.

What is interesting to me. If I have two functions that do the same, and Im interested in the difference in speed, does it make sense to test this without external tools, with timers, or is this influence on the results compiled in testing significantly?

I ask about this because if it is reasonable as a C ++ programmer, I want to know how to do this best, since they are much simpler than using external tools. If that makes sense, let's continue with all the possible traps:

Consider this example. The following code shows two ways to do the same thing:

#include <algorithm> #include <ctime> #include <iostream> typedef unsigned char byte; inline void swapBytes( void* in, size_t n ) { for( size_t lo=0, hi=n-1; hi>lo; ++lo, --hi ) in[lo] ^= in[hi] , in[hi] ^= in[lo] , in[lo] ^= in[hi] ; } int main() { byte arr[9] = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' }; const int iterations = 100000000; clock_t begin = clock(); for( int i=iterations; i!=0; --i ) swapBytes( arr, 8 ); clock_t middle = clock(); for( int i=iterations; i!=0; --i ) std::reverse( arr, arr+8 ); clock_t end = clock(); double secSwap = (double) ( middle-begin ) / CLOCKS_PER_SEC; double secReve = (double) ( end-middle ) / CLOCKS_PER_SEC; std::cout << "swapBytes, for: " << iterations << " times takes: " << middle-begin << " clock ticks, which is: " << secSwap << "sec." << std::endl; std::cout << "std::reverse, for: " << iterations << " times takes: " << end-middle << " clock ticks, which is: " << secReve << "sec." << std::endl; std::cin.get(); return 0; } // Output: // Release: // swapBytes, for: 100000000 times takes: 3000 clock ticks, which is: 3sec. // std::reverse, for: 100000000 times takes: 1437 clock ticks, which is: 1.437sec. // Debug: // swapBytes, for: 10000000 times takes: 1781 clock ticks, which is: 1.781sec. // std::reverse, for: 10000000 times takes: 12781 clock ticks, which is: 12.781sec. 

Problems:

  • What timers to use and how to get the processor time actually consumed by the code in question?
  • What are the implications of compiler optimization (since these functions just change the bytes back and forth, the most efficient thing, obviously, does nothing at all)?
  • Given the results presented here, do you think they are accurate (I can assure you that multiple runs give very similar results)? If so, can you explain how std :: reverse becomes so fast, given the simplicity of the user-defined function. I do not have the source code from the vC ++ version that I used for this test, but here is the implementation from GNU. It comes down to the iter_swap function, which is completely incomprehensible to me. Is this expected to work twice as fast as this custom function, and if so, why?

Contemplations:

  • It seems that two high-precision timers are offered: clock () and QueryPerformanceCounter (in windows). Obviously, we would like to measure the processor time of our code, and not the real time, but, as I understand it, these functions do not provide this functionality, so other processes in the system will interfere with the measurements. This page in the gnu c library seems to contradict this, but when I set a breakpoint in vC ++, the debugged process gets a lot of clock cycles even though it was paused (I did not test it under gnu). Do I have no alternative counters for this, or do we need at least special libraries or classes for this? If not, is there enough hours in this example or will there be a reason for using QueryPerformanceCounter?

  • What can we know for sure without debugging, disassembling and profiling? Is something really going on? Is function call strict or not? When checking in the debugger, the bytes actually change places, but I would rather learn from the theory why than from testing.

Thanks for any guidance.

Update

Thanks to a tip from tojas, the swapBytes function now works as fast as std :: reverse. I did not understand that a temporary copy in the case of a byte should only be case-sensitive and therefore very fast. Elegance can dazzle you.

 inline void swapBytes( byte* in, size_t n ) { byte t; for( int i=0; i<7-i; ++i ) { t = in[i]; in[i] = in[7-i]; in[7-i] = t; } } 

Thanks to a tip from ChrisW, I found that in windows you can get the actual cpu time spent on the process (read: yours) Windows Management Tool . It definitely looks more interesting than a high precision counter.

+8
c ++ performance profiling timer
Jun 27 '10 at 17:01
source share
8 answers

Obviously, we would like to measure the processor time of our code, and not the real time, but, as I understand it, these functions do not provide this functionality, so other processes in the system will interfere with the measurements.

I do two things to ensure that the wall clock and processor time are approximately the same:

  • Testing over a significant period of time, i.e. a few seconds (for example, by testing a cycle of several thousand iterations)

  • Test when the machine is more or less relatively idle, except for what I'm testing.

Alternatively, if you want to only / more accurately measure processor time per thread, available as a performance counter (see, for example, perfmon.exe ).

What can we know for sure without debugging, disassembling and profiling?

Almost nothing (except that I / O tends to be relatively slow).

+4
Jun 27 '10 at 17:37
source share

To answer the main question, it is the "reverse" algorithm that simply swaps the elements from the array and does not work with the elements of the array.

+2
Jun 27 '10 at 17:49
source share

Can you say that you are asking two questions?

  • Which one is faster and how much?

  • And why is it faster?

Firstly, you do not need high-precision timers. All you have to do is run them β€œlong enough” and measure using low accuracy timers. (I'm old-fashioned, my watch has a stopwatch function, and that’s enough.)

For the second one, of course, you can run the code under the debugger and do it in one step at the instruction level. Since the basic operations are so simple, you can easily see how many instructions are required for the base cycle.

Just think. Performance is not a difficult issue. Usually people try to find problems for which this is a simple approach .

+2
Jun 27 2018-10-18T00:
source share

Use QueryPerformanceCounter on Windows if you need high resolution synchronization. The accuracy of the counter depends on the processor, but it can increase to a clock pulse. However, profiling in real operations is always the best idea.

+2
Jun 27 2018-10-10T00-06-26
source share

(This answer is specific to Windows XP and the 32-bit VC ++ compiler.)

The simplest thing for synchronizing small bits of code is a processor time counter. This is a 64-bit value, the number of impressions of the processor cycles, which is approximately the same exact resolution as you are going to get. The actual numbers you get are not particularly useful as they stand, but if you average several runs of different competing approaches, you can compare them this way. The results are a bit noisy, but still valid for comparison purposes.

To read the time counter, use the following code:

 LARGE_INTEGER tsc; __asm { cpuid rdtsc mov tsc.LowPart,eax mov tsc.HighPart,edx } 

(The cpuid instruction is intended to prevent the completion of any incomplete instructions.)

There are four things worth noting about this approach.

Firstly, because of the built-in assembler language, it will not work as-on on the MS x64 compiler. (You will need to create an .ASM file with a function in it. Exercise for the reader, I do not know the details.)

Secondly, in order to avoid problems with the synchronization of cycle counters in different cores / threads / that you have, you may find that you need to set the affinity for the process so that it runs only on one specific actuator. (Again ... you cannot.)

Third, you will definitely want to check the generated assembler language to make sure that the compiler generates roughly the code that you expect. Watch for the removal of code fragments, functions that are built-in, of this kind.

Finally, the results are quite noisy. Cycle counters count the cycles spent on everything, including waiting for caches, the time taken to start other processes, the time spent in the OS itself, etc. Unfortunately, it is not possible (under Windows, at least) time just for your process. Therefore, I suggest repeatedly testing the code under testing (several tens of thousands) and working out an average value. This is not very tricky, but apparently it brought me useful results. Volleyball.

+2
Jun 27 '10 at 20:03
source share

Do you have anything against profilers? They help a ton. Since you are on WinXP, you should really try vtune. Try the test to select the graph of calls and see the time of independence and the total time of the called functions. There is no better way to customize your program so that it is as possible as possible, without being a build genius (and truly exceptional).

Some people seem to be allergic to profilers. I used to be one of those, and I thought I knew better where my hot spots are. I was often right about obvious algorithmic inefficiencies, but almost always wrong with regard to more cases of micro-optimization. By simply rewriting a function without changing any logic (for example: reordering things, adding an exceptional case code to a separate, non-built-in function, etc.), it can make functions tens of times faster, and even the best disassembly experts usually cannot predict that without a profiler.

As relying only on simplified time tests, they are extremely problematic. This current test is not so bad, but very often it is necessary to write time tests in such a way that the optimizer optimizes dead code and ultimately tests the time required to do practically nop or nothing at all. You must have some knowledge to interpret the disassembly to make sure that the compiler does not.

Also, time tests like this tend to significantly bias the results, since many of them simply include running your code over and over in a single loop, which usually just checks the effect of your code when all the memory is in the cache with full branch prediction. It often just shows the best scenarios without showing you the average real case.

Depending on real time, time tests are slightly better; something closer to what your application will do at a high level. This will not give you details about what takes what time, but exactly what the profiler should do.

+1
Jun 27 '10 at 17:24
source share

I would suggest that anyone who is competent enough to answer all your questions is too busy to answer all your questions. In practice, it is probably more efficient to ask common, well-defined questions. In this way, you can hope to get clearly defined answers that you can collect and be on the path to wisdom.

So, anyway, maybe I can answer your question about which clock to use on Windows.

clock () does not count as precision clocks. If you look at the value of CLOCKS_PER_SEC, you will see that it has a resolution of 1 millisecond. This is adequate enough if you use very long procedures or a loop with 10,000 iterations. As you noticed, if you try to repeat the simple method 10,000 times to get a time that can be measured using the clock (), the compiler can intervene and optimize all this.

So, in fact, the only clock to use is QueryPerformanceCounter ()

+1
Jun 27 '10 at 17:24
source share

Wha How to measure speed without a profiler? The process of measuring speed itself is profiled! The question is, "how can I write my own profiler?" And the answer is clearly not.

In addition, you should use std::swap in the first place, which completes all this senseless persecution.

-1 for meaninglessness.

-3
Jun 27 '10 at 7:18
source share



All Articles