JVM freezes periodically

Trying to debug Java VM erroneous operation. This process is a large virtual machine (a bunch of 100 GB) that runs Sun VM 1.6u24 on Centos 5, which performs the usual operational work - that is, access to the database, file input / output, etc.

After rebooting the process to update the software version, we noticed that its throughput was significantly reduced. In most cases, the top report is that the Java process makes full use of 2 cores. During this time, the virtual machine does not respond completely: no logs are recorded and does not respond to external tools such as jstack or kill -3. As soon as the VM is restored, the process continues, as usual, until the next freeze.

strace shows that during these hangs only system calls make only 2 threads. These were VM threads "VM Thread" (21776) and "Periodic task of the VM task" (21786). Presumably, these 2 threads use processor time. Sometimes application threads wake up and do their job. The rest of the time, they seem to be waiting at different futexes. By the way, the first line of the normal phase is always SIGSEGV.

[pid 21776] sched_yield() = 0 [pid 21776] sched_yield() = 0 [pid 21776] sched_yield( <unfinished ...> [pid 21786] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 21776] <... sched_yield resumed> ) = 0 [pid 21786] futex(0x2aabac71ef28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 21776] sched_yield( <unfinished ...> [pid 21786] <... futex resumed> ) = 0 [pid 21786] clock_gettime(CLOCK_MONOTONIC, {517080, 280918033}) = 0 [pid 21786] clock_gettime(CLOCK_REALTIME, {1369750039, 794028000}) = 0 [pid 21786] futex(0x2aabb81b94c4, FUTEX_WAIT_PRIVATE, 1, {0, 49923000} <unfinished ...> [pid 21776] <... sched_yield resumed> ) = 0 [pid 21776] sched_yield() = 0 [pid 21776] sched_yield() = 0 [pid 21955] --- SIGSEGV (Segmentation fault) @ 0 (0) --- [pid 21955] rt_sigreturn(0x2b1cde2f54ad <unfinished ...> 

The problem appears on two different servers. Rollback of our version of the code worked only on one of two servers. Error messages were not reported in the system logs, and another Java process on the affected machine behaves correctly.

This next result was obtained using gstack and shows 2 typical waiting application flows:

 Thread 552 (Thread 0x4935f940 (LWP 21906)): #0 0x00000030b040ae00 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00002b1cdd8548d6 in os::PlatformEvent::park(long) () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so #2 0x00002b1cdd92b230 in ObjectMonitor::wait(long, bool, Thread*) () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so #3 0x00002b1cdd928853 in ObjectSynchronizer::wait(Handle, long, Thread*) () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so #4 0x00002b1cdd69b716 in JVM_MonitorWait () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so #5 0x00002b1cde193cc8 in ?? () #6 0x00002b1ce2552d90 in ?? () #7 0x00002b1cdd84fc23 in os::javaTimeMillis() () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so #8 0x00002b1cde188a82 in ?? () #9 0x0000000000000000 in ?? () Thread 551 (Thread 0x49460940 (LWP 21907)): #0 0x00000030b040ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00002b1cdd854d6f in Parker::park(bool, long) () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so #2 0x00002b1cdd98a1c8 in Unsafe_Park () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so #3 0x00002b1cde193cc8 in ?? () #4 0x000000004945f798 in ?? () #5 0x00002b1cde188a82 in ?? () #6 0x0000000000000000 in ?? () 

We examined problems with NTPD, including second-level errors, but the suggested workarounds did not help, nor did we use external NTPD servers. Rebooting the machine alone did not help. We have the GC protocol enabled, and it does not look like a GC problem, since there are no messages about this. If you are looking for any suggestions that can help in this matter, any help is greatly appreciated.

+8
java jvm hang strace centos5
source share
1 answer

Here are a few things I would look at:

  • If the JVM is not responding, use iostat and vmstat to check if the system is broken. This can happen when you reallocate memory; those. your shared system uses significantly larger virtual memory than physical memory.

  • Turn on GC JVM logging and see if there is a correlation between the unresponsive JVM and the GC is working.

+3
source share

All Articles