Getting Alert Before Full GC

In the context of a soft real-time system that should not stop for more than 200 ms, we are looking for a way to get a warning before the full GC appears. We understand that we may not be able to avoid this, but we would like to switch to another node before the system crashes.

We were able to devise a scheme that would give us a preliminary warning, in front of the inevitable full GC, which could lead to a system shutdown in a few seconds (which we need to avoid).

What we could find depends on the statistics of free lists CMS: -XX:PrintFLSStatistics=1 . This prints the statistics of the free list to the GC log after each GC cycle, including the young GC, so the information is available at short intervals and will appear even more often during periods of high speed memory allocation. It probably costs a little in terms of performance, but our working assumption is that we can afford it.

Logging out is as follows:

 Statistics for BinaryTreeDictionary: ------------------------------------ Total Free Space: 382153298 Max Chunk Size: 382064598 Number of Blocks: 28 Av. Block Size: 13648332 Tree Height: 8 

In particular, the maximum size of a free fragment is 382064598 words. With 64-bit words, this should be just below 2915 MB. This number decreased very slowly, at a speed of about 1 MB per hour.

In our opinion, as long as the maximum size of a free piece is larger than the younger generation (provided that a mixture of objects is not highlighted), each promotion of the object must be successful.

We recently conducted stress tests for several days and saw that the CMS was able to maintain maximum block sizes above 94% of the total space of the old region. The maximum free block size decreases with a speed of less than 1 MB / hour, which should be good - in accordance with this, we will not fully hit the full GC in the near future, and the servers will most likely be unavailable for maintenance often than the full GC can happen.

In the previous test, when the system was less efficient in terms of memory, we were able to start the system within 10 hours. During the first hour, the maximum size of a free piece decreased to 100 MB, where it remained for more than 8 hours. Over the past 40 minutes of run, the maximum free piece size decreased at a constant speed to 0 when a full GC occurred - it was very encouraging because for this workload we seemed to get a 40-minute advance warning (when the block size started steady decrease towards 0).

My question is for you . Assuming all this reflects the long-term maximum workload (the workload at any given time in production will only be lower), does this seem like the right approach? With what degree of reliability do you think that we can count on the maximum statistics of the size of a free piece from the GC magazine?

We are definitely open to suggestions, but we ask that they be limited to the solutions available for HotSpot (without Azul for us, at least for now). In addition, G1 alone is not a solution unless we get a similar metric that gives us a preliminary warning before full GCs or any GCs that significantly exceed our SLA (and this can happen sometimes).

+8
java performance garbage-collection real-time
source share
2 answers

I am posting relevant excerpts here from the very instructive and encouraging answer of John Masamitsu from Oracle, which I received from the HotSpot GC mailing list (hotspot-gc-use@openjdk.java.net) - it works on HotSpot, so this is really good news.

In any case, the question remains open at the moment (I can not consider myself quoting an email :-)), so please add your suggestions!

Formatting: Quotations from the original post more strongly depart from John's answer.

We understand that as long as the maximum size of a free piece is larger than the younger generation (provided that there will be no condensed object distribution), each promotion of the object must be successful.

To a very large extent, this is correct. There are circumstances that an object promoted from a young generation to a CMS generation will require more space in CMS generation than to a younger generation. I do not think this is happening to a large extent.

The above is very encouraging, as we can definitely allocate some spare memory to protect against the rare cases that he describes, and it looks like otherwise we will be fine.

<- notch β†’

My question to you : if all this reflects a long peak workload (the workload at any given time in production will be lower), does this sound like a real approach? To what degree of reliability do you think that we can count on the maximum free piece size statistics from the GC magazine?

The maximum size of the free fragment is exactly at the time when the GC prints it, but it may be outdated by the time you read it and make your decisions.

For our workloads, this metric is on a very slow downward spiral, so a little durability will not hurt us.

<- notch β†’

We are certainly open to suggestions, but we ask for them limited by the solutions available on HotSpot (for us there is no Azul, at least for now). In addition, G1 alone is not a solution if we cannot come up with a similar indicator that will give us a preliminary warning before Full GCs or any GCs that significantly exceed our SLAs (and they can sometimes happen).

I think using the maximum free piece size as a metric is a good choice. It is very conservative (which sounds the way you want) and not subject to odd mixtures of object sizes.

For G1, I think you could use the number of absolutely free regions. I don’t know if it will be printed in any of the magazines at present, but this is probably the metric that we support (or can easily). If the number of completely free regions decreases over time, this may mean that full GC occurs.

John

Thanks, John!

+2
source share

Divide and win!

Your system uses a lot of memory and should be highly sensitive. So redesign your system architecture to reach the stand.

Define a real-time critical task and your business rules to create a java process for it. And used any non-standard programming practice on it, the idea is not dependent on the GC to maintain a clean memory. Think about it and be creative.

Now create other layers and processes, process the rest and create code to connect to everyone.

And even you can plan your life in real time or check the response time to kill it and create a new new one. But I can expect that you will not need to kill him to be tall.

Good luck

0
source share

All Articles