Profiling help needed

I have a problem with profiling - imagine that I have the following code ...

void main() { well_written_function(); badly_written_function(); } void well_written_function() { for (a small number) { highly_optimised_subroutine(); } } void badly_written_function() { for (a wastefully and unnecessarily large number) { highly_optimised_subroutine(); } } void highly_optimised_subroutine() { // lots of code } 

If I run it under vtune (or other profilers), it is very difficult to determine if something is wrong. All hot spots will appear in the section marked "// a lot of code", which is already optimized. The badly_written_function () function will not be allocated in any way, although this is the cause of all the problems.

Is there any vtune feature that will help me find the problem?

Is there some kind of mode thanks to which I can find the time spent by badly_written_function () and all its subfunctions ?

+7
c ++ c visual-studio-2008 vtune
source share
4 answers

This is usually called a “callgraph profile,” and I'm sure Visual Studio will do it.

+1
source share

Rolling your own very robust profiler is not that hard. Paste into main ():

 int main() { profileCpuUsage(1); // start timer #1 well_written_function(); profileCpuUsage(2); // stop timer #1, and start timer #2 badly_written_function(); profileCpuUsage(-1); // print stats for timers #1 and #2 return 0; } 

Where:

 #define NUMBER(a) ((int)(sizeof(a) / sizeof(a)[0])) void profileCpuUsage(int slice) { static struct { int iterations; double elapsedTime; } slices[30]; // 0 is a don't care slice if (slice < 0) { // -1 = print if (slices[0].iterations) for (slice = 1; slice < NUMBER(slices); slice++) printf("Slice %2d Iterations %7d Seconds %7.3f\n", slice, slices[slice].iterations, slices[slice].elapsedTime); } else { static int i; // = previous slice static double t; // = previous t1 const double t1 = realElapsedTime(); // see below for definition assert (slice < NUMBER(slices)); slices[i].iterations += 1; slices[i].elapsedTime += t1 - t; // i = 0 first time through i = slice; t = t1; } } 

Now, admittedly, in your simple example, using this CpuUsage () profile is of little use. And this has the disadvantage of requiring you to manually anchor your code by calling the profileCpuUsage () function in appropriate places.

But the benefits include:

  • You can use any piece of code, not just procedures.
  • Add and remove quickly as you perform a binary search to find and / or remove code hotspots.
  • It focuses only on the code you are interested in.
  • Portable!
  • KISS

One difficult intolerable thing is to define the realElapsedTime () function so that it provides enough detail to get valid times. This usually works for me (using the Windows API under CYGWIN):

 #include <windows.h> double realElapsedTime(void) // <-- granularity about 50 microsec on test machines { static LARGE_INTEGER freq, start; LARGE_INTEGER count; if (!QueryPerformanceCounter(&count)) assert(0 && "QueryPerformanceCounter"); if (!freq.QuadPart) { // one time initialization if (!QueryPerformanceFrequency(&freq)) assert(0 && "QueryPerformanceFrequency"); start = count; } return (double)(count.QuadPart - start.QuadPart) / freq.QuadPart; } 

For direct Unix, there is one thing in common:

 double realElapsedTime(void) // returns 0 first time called { static struct timeval t0; struct timeval tv; gettimeofday(&tv, 0); if (!t0.tv_sec) t0 = tv; return tv.tv_sec - t0.tv_sec + (tv.tv_usec - t0.tv_usec) / 1000000.; } 

realElapsedTime () gives the wall clock time, not the processing time, which is usually what I want.

There are also other less portable methods to achieve finer granularity using RDTSC; see, for example, http://en.wikipedia.org/wiki/Time_Stamp_Counter and its links, but I have not tried them.

Edit: ravenspoint is a very nice answer, it seems not too different from mine. And his answer uses beautiful descriptive strings, not just ugly numbers, which I often disappointed. But this can only be fixed with a dozen extra lines (but it almost doubles the number of lines!).

Note that we want to avoid using malloc (), and I even doubt strcmp () a bit. Thus, the number of slices never increases. And hash collisions are simply flagged, but rather eliminated: a person’s profiler can fix this by manually typing the number of sections from 30 or changing the description. Not verified

 static unsigned gethash(const char *str) // "djb2", for example { unsigned c, hash = 5381; while ((c = *str++)) hash = ((hash << 5) + hash) + c; // hash * 33 + c return hash; } void profileCpuUsage(const char *description) { static struct { int iterations; double elapsedTime; char description[20]; // added! } slices[30]; if (!description) { // print stats, but using description, mostly unchanged... } else { const int slice = gethash(description) % NUMBER(slices); if (!slices[slice].description[0]) { // if new slice assert(strlen(description) < sizeof slices[slice].description); strcpy(slices[slice].description, description); } else if (!!strcmp(slices[slice].description, description)) { strcpy(slices[slice].description, "!!hash conflict!!"); } // remainder unchanged... } } 

And another point is that usually you want to disable this profiling for release versions; this also applies to ravenspoint answer. This can be done using the trick of using an evil macro to determine it:

 #define profileCpuUsage(foo) // = nothing 

If this is done, you will of course need to add parentheses to the definition to disable the disconnect macro:

 void (profileCpuUsage)(const char *description)... 
+1
source share

Can I offer my own open source profiler raven :: set :: cRunWatch? It is designed specifically for this problem and works on Windows using Visual Studio 2008 Standard Edition, so you do not need to pay for the version that includes the profiler.

I took your code, slightly modified it so that it compiles without forward declarations and added the necessary calls to cRunWatch

 // RunWatchDemo.cpp : Defines the entry point for the console application. // #include "stdafx.h" void highly_optimised_subroutine() { raven::set::cRunWatch runwatch("highly_optimised_subroutine"); Sleep( 2 ); } void badly_written_function() { raven::set::cRunWatch runwatch("badly_written_function"); for (int k = 1; k < 1000; k++ ) { highly_optimised_subroutine(); } } void well_written_function() { raven::set::cRunWatch runwatch("well_written_function"); for (int k = 1; k < 10; k++ ) { highly_optimised_subroutine(); } } int _tmain(int argc, _TCHAR* argv[]) { raven::set::cRunWatch::Start(); well_written_function(); badly_written_function(); raven::set::cRunWatch::Report(); return 0; } 

At startup, gives output

 raven::set::cRunWatch code timing profile Scope Calls Mean (secs) Total highly_optimised_subroutine 1008 0.002921 2.944146 badly_written_function 1 2.926662 2.926662 well_written_function 1 0.026239 0.026239 

This shows that badly_written_function is a very close second time user and therefore the culprit.

You can get cRunWatch from here. You will find a sample code in the User Guide :-)

+1
source share

This is usually what you want to watch for the total time function, not your time, to make sure you look at the time, which includes the time of the called functions.

In VTune, I would recommend using the top-down tab. Or better yet, and if you're using the latest update, try out the new Caller-Callee pilot. You can get detailed information about this here - http://software.intel.com/en-us/forums/topic/376210 . It gets a flat list of functions with their total time so that you can view what is the most time-consuming subtrees in your program.

0
source share

All Articles