Interpret the differences in prstat vs. 'prstat -m' on Solaris

I have used prstat and prstat -ma recently to study performance issues, and I think I basically understood the differences in sampling and microstatus in Solaris 10. Therefore, I do not expect both to always show exactly the same number.

Today I was faced with a situation where 2 showed very different results, that I have problems with their interpretation and understanding of the result. The machine is a heavily loaded Solaris 10 8-processor with several large WebSphere processes and an Oracle database. The system practically stopped today for about 15 minutes (average load> 700). It was difficult for me to get any prstat information, but I was able to get some results from "prtstat 1 1" and "prtstat -m 1 1", released soon one after another.

Top lines of outputs:

prstat 1 1:

  PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS / NLWP
   8379 was 3208M 2773M cpu5 60 0 5:29:13 19% java / 145
   7123 was 3159M 2756M run 59 0 5:26:45 7.7% java / 109
   5855 app1 1132M 26M cpu2 60 0 0:01:01 7.7% java / 18
  16503 monitor 494M 286M run 59 19 1:01:08 7.1% java / 106
   7112 oracle 15G 15G run 59 0 0:00:10 4.5% oracle / 1
   7124 oracle 15G 15G cpu3 60 0 0:00:10 4.5% oracle / 1
   7087 app1 15G 15G run 58 0 0:00:09 4.0% oracle / 1
   7155 oracle 96M 6336K cpu1 60 0 0:00:07 3.6% oracle / 1
 ...
 Total: 495 processes, 4581 lwps, load averages: 74.79, 35.35, 23.8

prstat -m 1 1 (a few seconds later)

  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS / NLWP  
   7087 app1 0.1 56 0.0 0.2 0.4 0.0 13 30 96 2 33 0 oracle / 1
   7153 oracle 0.1 53 0.0 3.2 1.1 0.0 1.0 42 82 0 14 0 oracle / 1
   7124 oracle 0.1 47 0.0 0.2 0.2 0.0 0.0 52 77 2 16 0 oracle / 1
   7112 oracle 0.1 47 0.0 0.4 0.1 0.0 0.0 52 79 1 16 0 oracle / 1
   7259 oracle 0.1 45 9.4 0.0 0.3 0.0 0.1 45 71 2 32 0 oracle / 1
   7155 oracle 0.0 42 11 0.0 0.5 0.0 0.1 46 90 1 9 0 oracle / 1
   7261 oracle 0.0 37 9.5 0.0 0.3 0.0 0.0 53 61 1 17 0 oracle / 1
   7284 oracle 0.0 32 5.9 0.0 0.2 0.0 0.1 62 53 1 21 0 oracle / 1
 ...
 Total: 497 processes, 4576 lwps, load averages: 88.86, 39.93, 25.51

It is very difficult for me to interpret the conclusion. It seems to me that prstat tells me that there is quite a bit of Java processing going along with some Oracle materials, as expected in a normal situation. prtstat -m shows that the machine is completely dominant in Oracle, consuming a huge amount of system time, and the overall processor is very overloaded (large numbers in LAT).

I tend to believe the conclusions of prstat -m because it comes close to what the system felt during this time. Completely sluggish, almost no user request from WebSphere, etc. But why does prstat show so much different numbers?

Any explanation for this is welcome!

CU, Joe

+4
source share
1 answer

A known issue with prstat -m on Solaris is how CPU utilization is calculated — the value you see was averaged over all threads (LWPs) in the process and therefore much too low for highly multi-threaded processes — for example, your Java application servers that can have hundreds of threads each (see your NLWP). Less than a dozen of them are probably processors, so using a java processor seems "low." You will need to call it with prstat -Lm to get a breakdown for each LWP (stream) to see this effect. Reference:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6780169

Without further performance monitoring data, it is difficult to give non-speculative explanations of what you saw there. I assume blocking is inconsistent inside java. One specific workload that can lead to this is I / O multithreading with multithreaded memory mapping, all of which will accumulate in blocking the process address space. But it can be purely java userside. A plockstat in one of the java processes and / or a simple dtrace profiling would be helpful.

+4
source

All Articles