Object.wait () exceeds the timeout

What can explain that the duration of Object.wait(timeout) exceeds the provided timeout value?

 long start = System.currentTimeMillis(); obj.wait(1000); long duration = System.currentTimeMillis() - start; // sometimes (very rarely) duration may exceed 1500 

Context: somewhere in the depths of very complex software there is a piece of code that makes such a wait and generates warning logs in case of excessive duration. In a high-traffic production environment, some logs report huge expectations (for example, 30 seconds). Therefore, I am trying to reproduce it, to understand what could happen, and how to fix / improve it.

+7
java
source share
3 answers

The “user time” or “wall clock-hours” time spent on a “wait (timeout)” call is usually the timeout value plus the time until the thread is rescheduled for execution and executed.

See the Javadoc for the Object.wait (long timeout) method:

Then the thread T [...] is again turned on for scheduling flows. Then it competes in the usual way with other threads for the right to synchronize with the object;

Thus, there is no guarantee for the "real-time" operation, it is rather a kind of "best attempt", depending on the current system load and, possibly, also on other lock dependencies in your application. Therefore, if the system is under heavy load or your application processes a lot of threads, the wait can take significantly longer.

PS
The @ nathan-hughes quote mentioned in his commentary on your question is probably the key suggestion in the Javadoc of the "wait" method: The specified amount of real time has elapsed, more or less .

PPS
Based on a change in your question with additional context information (“very complex software”, “high traffic”, “high expectations”): you should find all the uses of your obj object as a lock and determine how these interactions interact with each other .

It can get really complicated. Here it will try to outline a “simple” scenario of what might go wrong with only two simple threads, for example, this:

 // thread 1 synchronized (obj) { // wait 1000ms obj.wait(1000); } // check for overwait // thread 2, after, let say 500 ms synchronized (obj) { obj.notify(); } 

A simple script, everything is in order, the order of execution is approximately:

  • 0ms: T1 uses lock on 'obj'
  • 0ms: T1 is registered as a pending "obj" and is excluded from thread scheduling. Although an exception to thread scheduling, the lock on 'obj' is released again (!)
  • 500ms: T2 uses a lock on 'obj', notifies one thread that is waiting for notification (a thread is selected based on thread scheduling settings) and releases the lock on 'obj'
  • 500ms + X: T1 is turned on again for scheduling threads, it waits until it intercepts the lock on 'obj' (!) , After which it completes its lock and releases the 'OBJ lock.

These are just 2 simple threads and synchronized . Let it be more complicated with poorly written code . What if the 2nd stream is something like this:

 // bad variant of thread 2, after, let say 500 ms synchronized (obj) { obj.notify(); // do complex operation, taking more than few ms, // maybe a heavy SQL query/update... } 

In this case, despite the fact that T1 received a notification (or possibly a wait time), he needs to wait until he again restores the lock on “obj” , which is still held by T2 for as long as the complex operation is performed ( step 3 in the previous list)! This may take up to several seconds.

Even harder: we return to our initial simple streams T1 and T2, but add a third stream:

 // thread 3, after, let say also 500 ms synchronized (obj) { // do complex operation, taking more than few ms, // maybe a heavy SQL query/update... } 

The order of execution may become approximately:

  • 0ms: T1 uses lock on 'obj'
  • 0ms: T1 is registered as a pending "obj" and is excluded from thread scheduling. Although an exception to thread scheduling, the lock on 'obj' is released again (!)
  • 500ms: T2 uses a lock on 'obj', notifies one thread that is waiting for notification (a thread is selected based on thread scheduling settings) and releases the lock on 'obj'
  • 500ms + X: T2 is turned on again for scheduling threads, but does not get a lock on 'obj' because
  • 500ms + X: T3 is scheduled by the thread scheduler before T1, and it uses the lock on 'obj' (!) And starts performing a complicated operation. T1 can do nothing but wait!
  • 500ms + MANY: T3 * releases the lock on 'obj'.
  • 500ms + MANY: T1 overwrites the lock on 'obj' (!) , Then exits its synchronized block and releases the lock on 'obj'.

It only scratches the surface of what can happen in your "very sophisticated software," with "high traffic." Add more streams, possibly poorly encoded (for example, do too much in "synchronized" blocks), high traffic, and you can easily get the answers you mentioned.

OPTIONS
How to solve this ... depends on the purpose and complexity of your software, there is no simple plan. No more can be said based on the information available.

Perhaps re-analyzing the code with a pen and paper is enough, perhaps profiling can help you find locks, maybe you can get the necessary information about current locks through JMX or a dump stream (through a signal, jconsole, jcmd, jvisualvm), or by monitoring using Java Mission Control and Java Flight Recording (features available with ... JDK 7u40, I think).

You asked in a comment if Thread.sleep(timeout) would help: you can't say it without additional information. Maybe this will help. Or perhaps retentor locks or other lock options (see java.util.concurrent , java.util.concurrent.atomic , java.util.concurrent.locks packages) would be more appropriate. It depends on your code, your use case, and your version of Java.

If the GC is not a problem (see below) and you have analyzed the code, it “looks great”, and you think that the reason is high traffic, you can also think about the possibility of shifting the lock or locking the lock. See the Java 7 JVM options for more details (this article provides links to Java 8 JVM options).

GARBAGE COLLECTION
By the way, the “high traffic” was supposed to make me ask about this earlier: garbage collection, did you control it? If the setup is not configured / configured properly, the GC can also often lead to very significant pauses! (I have this week this case, 15-30 seconds for a full GC ...)

+3
source share

When the flow does indeed wake up after the waiting time or waking up, is not accurate. The sleeping paper has this note

It makes the current executable thread sleep (temporarily stop execution) for the specified number of milliseconds, taking into account the accuracy and accuracy of the system timers and schedulers. The thread does not lose ownership of any monitors.

Notification

also depends on these differences.

Thus, this is due to the accuracy of the timers in the system and the fact that other threads or processes execute at a time when the thread has the right to start again. A general rule - a timeout like this - is the minimum amount of time that will pass. Object.notify has an option that also takes nanoseconds, which gives you more precise control over the grain during the time that expires.

see Javadoc description about public final void wait(long timeout, int nanos)

+5
source share

Standby / Sleep should be used to ensure the logical order of program execution. There is no guarantee that the next thread will be scheduled, so the wait is often stored in a loop.

However, if you just want to check the reason for the timeout, try to find the thread that owns the lock and analyze this thread.

 ThreadMXBean bean = ManagementFactory.getThreadMXBean(); ThreadInfo[] ti = bean.getThreadInfo(bean.getAllThreadIds(), true, true); 

The ThreadInfo object contains LockInfo, you can get the lock hash code and check the thread that has the corresponding hash code to lock.

Also, if you can't really change the code, try turning on JMX during production (a restart may be required).

The following arguments should be added to the Java process.

 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=8989 

Port is the port for JMX. You can enable authentication if you need, then you will also have to provide the user / pass. Check here .

After you enable JMX, you can see the current state of the threads using Jvisualvm or jconsole . These tools also allow you to take a stream dump with the click of a button. Flow dump analysis can also give you a clue.

If you can enable remote debugging on the server using the JVM, you can debug a thread that contains a lock from your IDE. The following are arguments for the Java process that allow remote debugging

 -Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8888 

This will allow remote debugging on port 8888. If you are using eclipse, you can pause ( right-click suspend ) any scheduled thread in the debug perspective to see what it is doing now and debug it.

Good luck

+2
source share

All Articles