Trying to debug Java VM erroneous operation. This process is a large virtual machine (a bunch of 100 GB) that runs Sun VM 1.6u24 on Centos 5, which performs the usual operational work - that is, access to the database, file input / output, etc.
After rebooting the process to update the software version, we noticed that its throughput was significantly reduced. In most cases, the top report is that the Java process makes full use of 2 cores. During this time, the virtual machine does not respond completely: no logs are recorded and does not respond to external tools such as jstack or kill -3. As soon as the VM is restored, the process continues, as usual, until the next freeze.
strace shows that during these hangs only system calls make only 2 threads. These were VM threads "VM Thread" (21776) and "Periodic task of the VM task" (21786). Presumably, these 2 threads use processor time. Sometimes application threads wake up and do their job. The rest of the time, they seem to be waiting at different futexes. By the way, the first line of the normal phase is always SIGSEGV.
[pid 21776] sched_yield() = 0 [pid 21776] sched_yield() = 0 [pid 21776] sched_yield( <unfinished ...> [pid 21786] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 21776] <... sched_yield resumed> ) = 0 [pid 21786] futex(0x2aabac71ef28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 21776] sched_yield( <unfinished ...> [pid 21786] <... futex resumed> ) = 0 [pid 21786] clock_gettime(CLOCK_MONOTONIC, {517080, 280918033}) = 0 [pid 21786] clock_gettime(CLOCK_REALTIME, {1369750039, 794028000}) = 0 [pid 21786] futex(0x2aabb81b94c4, FUTEX_WAIT_PRIVATE, 1, {0, 49923000} <unfinished ...> [pid 21776] <... sched_yield resumed> ) = 0 [pid 21776] sched_yield() = 0 [pid 21776] sched_yield() = 0 [pid 21955] --- SIGSEGV (Segmentation fault) @ 0 (0) --- [pid 21955] rt_sigreturn(0x2b1cde2f54ad <unfinished ...>
The problem appears on two different servers. Rollback of our version of the code worked only on one of two servers. Error messages were not reported in the system logs, and another Java process on the affected machine behaves correctly.
This next result was obtained using gstack and shows 2 typical waiting application flows:
Thread 552 (Thread 0x4935f940 (LWP 21906)):
We examined problems with NTPD, including second-level errors, but the suggested workarounds did not help, nor did we use external NTPD servers. Rebooting the machine alone did not help. We have the GC protocol enabled, and it does not look like a GC problem, since there are no messages about this. If you are looking for any suggestions that can help in this matter, any help is greatly appreciated.