We have a Java EE application with several gigabyte heap sizes on our production servers. From time to time, any of our servers will no longer respond to any requests.
- When a problem occurs, the GC log indicates that the server spends a lot of time executing GCs, which take 8 to 10 seconds (usually they take less than 1).
- We never get OutOfMemoryErrors.
- The problem does not occur when the heap reaches a certain heap size - in fact, it arises with different heap sizes, none of which are even close to the configured maximum.
- The problem does not occur at a certain interval, at a certain time, to load the user or to certain server nodes. It seems completely random.
- Heap dumps, even those that were taken from the server while it was showing the problem, did not display anything that was clearly wrong.
- Restarting production servers every day, apparently, reduces the likelihood of a problem, but does not fix it.
- If we do not restart our servers every day, there is a high probability that a problem will occur on one of our 8 production servers within one to three days.
How would you begin to diagnose this?
Configuration
Our JAVA_OPTS are as follows: -Xms8096m -Xmx8096m -XX:MaxPermSize=512M -Dsun.rmi.dgc.client.gcInterval=1800000 -Dsun.rmi.dgc.server.gcInterval=1800000 -XX:NewSize=150M -XX:+UseParNewGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/path/to/gc.log
$ java -version java version "1.6.0_12" Java(TM) SE Runtime Environment (build 1.6.0_12-b04) Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode) $ uname -a Linux myhostname 2.6.18-274.3.1.el5
Gc log
This is an example of a GC log fragment when a problem occurred:
111036.554: [GC 111036.555: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111036.555: [Tenured: 3629252K->3647971K(5526912K), 8.7565190 secs] 5840068K->3647971K(8014016K), 8.7567840 secs] 111055.691: [GC 111055.691: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111055.691: [Tenured: 3647971K->3667529K(5526912K), 8.7876340 secs] 5858787K->3667529K(8014016K), 8.7878690 secs] 111071.037: [GC 111071.037: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111071.037: [Tenured: 3667529K->3692057K(5526912K), 8.7581830 secs] 5878345K->3692057K(8014016K), 8.7584210 secs] 111088.407: [GC 111088.407: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111088.407: [Tenured: 3692057K->3638194K(5526912K), 10.7072790 secs] 5902873K->3638194K(8014016K), 10.7074960 secs] 111110.238: [GC 111110.238: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111110.238: [Tenured: 3638194K->3654614K(5526912K), 8.8021440 secs] 5849010K->3654614K(8014016K), 8.8023860 secs] 111128.115: [GC 111128.115: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111128.115: [Tenured: 3654614K->3668670K(5526912K), 8.8451510 secs] 5865430K->3668670K(8014016K), 8.8453600 secs] 111161.684: [GC 111161.684: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111161.684: [Tenured: 3668670K->3684080K(5526912K), 8.8156740 secs] 5879486K->3684080K(8014016K), 8.8159260 secs] 111186.669: [GC 111186.669: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111186.669: [Tenured: 3684080K->3639333K(5526912K), 10.6025350 secs] 5894896K->3639333K(8014016K), 10.6030040 secs] 111208.692: [GC 111208.692: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111208.692: [Tenured: 3639333K->3657993K(5526912K), 8.7967920 secs] 5850149K->3657993K(8014016K), 8.7970090 secs] 111235.486: [GC 111235.487: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111235.487: [Tenured: 3657993K->3676521K(5526912K), 8.8212340 secs] 5868809K->3676521K(8014016K), 8.8214930 secs]