I will try to briefly explain the problem. I work in a supply chain domain where we deal with goods / products and SKUs.
Let's say my whole problem is 1 million SKUs, and I use an algorithm. The JVM heap size is now set to 4 GB.
I canβt process all the SKUs in one shot, as I will need a lot more memory. So, I divide the problem into smaller batches. Each batch will have all the associated SKUs that need to be processed together.
Now I run several iterations to process the entire data set. Say if each batch holds approx. 5000 SKU, I will have 200 iterations / loops. All data for 5000 SKUs is required before batch processing is completed. But when the next batch begins, the previous "packet data" is not required and, therefore, can be collected in garbage.
This is problem. Now, having come up with a specific performance problem due to the GC - Each batch takes about 2-3 seconds to complete. Now, during this time, the GC cannot release any objects, since all the data is necessary until the end of processing a particular batch. Thus, GC moves all these objects to the old Gen (if I look at your profiler, almost nothing in the new General). Thus, the old gene grows faster, and a full GC is needed, which makes my program very slow. Is there a way to configure GC in this case, or can it change my code to allocate memory differently?
PS - if each batch is very small, I do not see this problem. I believe that this is due to the fact that the GC is able to quickly release objects, since the package ends faster and, therefore, is not needed to move objects to the old gene.
source share