How should I interpret OProfile output?

I recently tried profiling my application using OProfile. The data collected is already very valuable to me, but I have difficulties with its accurate interpretation. After starting my application with configuring and running oprofile, I generated a report and received:

root @ se7xeon: srC # opreport image: test -l -t 1
Overflow statistics not available
CPU: P4 / Xeon with two hyperthreads, speed 3191.66 MHz (estimate)
It is considered GLOBAL_POWER_EVENTS events (the time during which the processor does not stop) with a block mask of 0x01 (required) score 750 000
samples% symbol name
215522 84.9954 cci :: Image :: interpolation (unsigned char *, cci :: Matrix const &) const
17998 7.0979 cci :: Calc :: diff (unsigned char const *, unsigned char const *)
13171 5.1942 cci :: image :: getIRect (unsigned char *, int, int) const
5519 2.1765 cci :: Image :: getFRect (unsigned char *, double, double) const

Ok, so my interpolation function is responsible for 84% of the application's runtime (too long). It seems like a good idea to study it then:

root @ se7xeon: srC # opannotate image: test --source
[...]

/* cci::Image::interpolate(unsigned char*, cci::Matrix<cci::Point2DF> const&) const total: 215522 84.9954 */ 1392 0.5529 :void Image::interpolate(CCIPixel *output, const Matrix<Point2DF> &inputPoints) const throw() 4 0.0016 :{ [...] : col0 = static_cast<int>(point[idx].x); 3 0.0012 : col1 = col0+1; 629 0.2498 : row0 = static_cast<int>(point[idx].y); 385 0.1529 : row1 = row0+1; 56214 22.3266 : if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height) : { : col0 = row0 = col1 = row1 = 0; : } 

If I understand correctly, the if condition is responsible for more than 22% of the program execution time. The opening bracket and function declaration seem to take time, should this correspond to the overhead of the function call ("push parameters on the stack sequence, jump, pop parameters")?

I changed some things in the source (associated with a later bottleneck because I had no idea how to optimize if), recompiled, ran through oprofile again (not forgetting opcontrol - reset). Now annotated code looks something like this:

 6 0.0024 : curPx = point[idx].x; 628 0.2477 : curPy = point[idx].y; 410 0.1617 : col0 = static_cast<int>(curPx); 57910 22.8380 : col1 = col0+1; : row0 = static_cast<int>(curPy); : row1 = row0+1; : if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height) : { : col0 = row0 = col1 = row1 = 0; : } 

This time, if there is no time (?) At all, the most expensive instruction is "col1 = col0 + 1", and the whole time unit seems to have moved up. How can it be? Can anyone trust to identify bottlenecks in the source?

Another point of doubt for me is that when I set up opcontrol, I introduced a tracked event like GLOBAL_POWER_EVENTS, the number of samples of which is 750k. At the output, the interpolation function, apparently, takes 84%, but the number of samples recorded inside it is slightly less than 200 thousand. This is not even 50% of the requested number. I have to understand that the remaining ~ 500 thousand samples were taken by applications not specified in the output (kernel, Xorg, etc.)?

+4
source share
2 answers

When profiling optimized code, you really cannot rely on exact lines of source code. The compiler moves things too much.

For an accurate image, you will need to look at the output of the code disassembler.

+2
source

OProfile can (they tell me) get stack samples in a wall clock (and not a CPU), and it can give you a linear level percentage. What you are looking for are strings that are contained in a large percentage of stack patterns.

I would not turn on compiler optimization until I finish setting up the code manually, because it just hides everything.

When you say that the interpolation procedure uses 84% โ€‹โ€‹of the time, this begs the question. The whole program takes some general time, right? It takes 100% of the time. If you cut the program time in half, or if you double it, it will still take 100% of the time. Whether the 84% for interpolation is too much or does not depend on whether it is done more than necessary.

So, I would advise you not to ask if the percentage of routine is too large. Rather, you are looking for lines of code that take a considerable amount of time and ask if they can be optimized. See the difference? After you optimize the code, it can significantly reduce the overall execution time, but it can still be a large percentage of a smaller amount. The code is not optimal when nothing takes a large percentage. The code is optimal when all the things that occupy a large percentage cannot be improved.

I don't need things that just give numbers. I want to understand. For example, if this account is 84% โ€‹โ€‹of the time, then if you took 10 stack samples , it will be on 8.4 of them. The exact number does not matter. It is important to understand why he is there. Was it really necessary to be there so much? What looks at stack patterns can tell you. Maybe you do the interpolation twice as often as necessary? Often people find out, analyzing why, that the routine that they are trying to speed up did not need to be called almost the same, maybe at all. I can not guess in your case. Only understanding that studies the state of the program can tell you this.

+2
source

All Articles