I have used prstat and prstat -ma recently to study performance issues, and I think I basically understood the differences in sampling and microstatus in Solaris 10. Therefore, I do not expect both to always show exactly the same number.
Today I was faced with a situation where 2 showed very different results, that I have problems with their interpretation and understanding of the result. The machine is a heavily loaded Solaris 10 8-processor with several large WebSphere processes and an Oracle database. The system practically stopped today for about 15 minutes (average load> 700). It was difficult for me to get any prstat information, but I was able to get some results from "prtstat 1 1" and "prtstat -m 1 1", released soon one after another.
Top lines of outputs:
prstat 1 1:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS / NLWP
8379 was 3208M 2773M cpu5 60 0 5:29:13 19% java / 145
7123 was 3159M 2756M run 59 0 5:26:45 7.7% java / 109
5855 app1 1132M 26M cpu2 60 0 0:01:01 7.7% java / 18
16503 monitor 494M 286M run 59 19 1:01:08 7.1% java / 106
7112 oracle 15G 15G run 59 0 0:00:10 4.5% oracle / 1
7124 oracle 15G 15G cpu3 60 0 0:00:10 4.5% oracle / 1
7087 app1 15G 15G run 58 0 0:00:09 4.0% oracle / 1
7155 oracle 96M 6336K cpu1 60 0 0:00:07 3.6% oracle / 1
...
Total: 495 processes, 4581 lwps, load averages: 74.79, 35.35, 23.8
prstat -m 1 1 (a few seconds later)
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS / NLWP
7087 app1 0.1 56 0.0 0.2 0.4 0.0 13 30 96 2 33 0 oracle / 1
7153 oracle 0.1 53 0.0 3.2 1.1 0.0 1.0 42 82 0 14 0 oracle / 1
7124 oracle 0.1 47 0.0 0.2 0.2 0.0 0.0 52 77 2 16 0 oracle / 1
7112 oracle 0.1 47 0.0 0.4 0.1 0.0 0.0 52 79 1 16 0 oracle / 1
7259 oracle 0.1 45 9.4 0.0 0.3 0.0 0.1 45 71 2 32 0 oracle / 1
7155 oracle 0.0 42 11 0.0 0.5 0.0 0.1 46 90 1 9 0 oracle / 1
7261 oracle 0.0 37 9.5 0.0 0.3 0.0 0.0 53 61 1 17 0 oracle / 1
7284 oracle 0.0 32 5.9 0.0 0.2 0.0 0.1 62 53 1 21 0 oracle / 1
...
Total: 497 processes, 4576 lwps, load averages: 88.86, 39.93, 25.51
It is very difficult for me to interpret the conclusion. It seems to me that prstat tells me that there is quite a bit of Java processing going along with some Oracle materials, as expected in a normal situation. prtstat -m shows that the machine is completely dominant in Oracle, consuming a huge amount of system time, and the overall processor is very overloaded (large numbers in LAT).
I tend to believe the conclusions of prstat -m because it comes close to what the system felt during this time. Completely sluggish, almost no user request from WebSphere, etc. But why does prstat show so much different numbers?
Any explanation for this is welcome!
CU, Joe