Background Information
I have a distributed processing application that performs data analysis. It is designed for parallel processing of many data sets updated in real time. Within the framework of the project, the analysis was divided into analytical nodes. Each node takes raw data and processes it to create other data that can then be used by other nodes. About 200 nodes are required to complete our full analysis on a single dataset.
In the current project, each node works with its own thread. Now, most of the time, these streams were asleep. Each time the data is updated, they wake up each, like a waterfall, and then fall asleep again. Currently, the application runs on 40 data sets, each of which requires 200 nodes, using 8000 threads. When no data arrives, there is no load on the server. When data arrives at the busiest time, the server reaches 25% of the CPU. All this is within the design and production parameters of the project.
Now for the next step, we scale 40 data sets to 200. Each set requires 200 nodes, which means a total of 40,000 nodes, which is 40,000 threads. This exceeds our server’s maximum PID, so I asked our server administrators to increase the cap size. They did this and the application works, but they gave me some feedback on the number of threads. I do not deny that the number of threads is unusual, but at this stage of our project is expected and guaranteed.
I am planning small design changes to separate the stream from node. This will allow us to configure a single thread to run multiple nodes and reduce the number of threads. For datasets that are not updated frequently, there will be very little effect of having one thread perform data updates on each node. For datasets that are updated hundreds of times per second, we can configure each node to run on its own thread. In fact, I have no doubt that this design change will be made - it is only a matter of when. At the same time, I would like as much information as possible about the implications of using this design.
Question
What is the cost of working with more than 40,000 threads per machine? How much performance do I lose if the JVM / Linux OS manages this many threads? Remember that all of them are correctly configured for sleep when there is no work. So, I'm just talking about the extra overhead and the problems caused by the sheer amount of threads.
Please note: I know that I can reduce the number of threads, and I know that it is a good idea to change the design. I will do this as soon as I can, but it must be balanced with other considerations of work and design. I ask this question to collect information in order to make the right decision. Your thoughts and comments on this character are greatly appreciated.
java performance optimization multithreading
Errick robertson
source share