I recently tried profiling my application using OProfile. The data collected is already very valuable to me, but I have difficulties with its accurate interpretation. After starting my application with configuring and running oprofile, I generated a report and received:
root @ se7xeon: srC # opreport image: test -l -t 1
Overflow statistics not available
CPU: P4 / Xeon with two hyperthreads, speed 3191.66 MHz (estimate)
It is considered GLOBAL_POWER_EVENTS events (the time during which the processor does not stop) with a block mask of 0x01 (required) score 750 000
samples% symbol name
215522 84.9954 cci :: Image :: interpolation (unsigned char *, cci :: Matrix const &) const
17998 7.0979 cci :: Calc :: diff (unsigned char const *, unsigned char const *)
13171 5.1942 cci :: image :: getIRect (unsigned char *, int, int) const
5519 2.1765 cci :: Image :: getFRect (unsigned char *, double, double) const
Ok, so my interpolation function is responsible for 84% of the application's runtime (too long). It seems like a good idea to study it then:
root @ se7xeon: srC # opannotate image: test --source
[...]
1392 0.5529 :void Image::interpolate(CCIPixel *output, const Matrix<Point2DF> &inputPoints) const throw() 4 0.0016 :{ [...] : col0 = static_cast<int>(point[idx].x); 3 0.0012 : col1 = col0+1; 629 0.2498 : row0 = static_cast<int>(point[idx].y); 385 0.1529 : row1 = row0+1; 56214 22.3266 : if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height) : { : col0 = row0 = col1 = row1 = 0; : }
If I understand correctly, the if condition is responsible for more than 22% of the program execution time. The opening bracket and function declaration seem to take time, should this correspond to the overhead of the function call ("push parameters on the stack sequence, jump, pop parameters")?
I changed some things in the source (associated with a later bottleneck because I had no idea how to optimize if), recompiled, ran through oprofile again (not forgetting opcontrol - reset). Now annotated code looks something like this:
6 0.0024 : curPx = point[idx].x; 628 0.2477 : curPy = point[idx].y; 410 0.1617 : col0 = static_cast<int>(curPx); 57910 22.8380 : col1 = col0+1; : row0 = static_cast<int>(curPy); : row1 = row0+1; : if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height) : { : col0 = row0 = col1 = row1 = 0; : }
This time, if there is no time (?) At all, the most expensive instruction is "col1 = col0 + 1", and the whole time unit seems to have moved up. How can it be? Can anyone trust to identify bottlenecks in the source?
Another point of doubt for me is that when I set up opcontrol, I introduced a tracked event like GLOBAL_POWER_EVENTS, the number of samples of which is 750k. At the output, the interpolation function, apparently, takes 84%, but the number of samples recorded inside it is slightly less than 200 thousand. This is not even 50% of the requested number. I have to understand that the remaining ~ 500 thousand samples were taken by applications not specified in the output (kernel, Xorg, etc.)?