This environment: Xeon processor with 16 cores, OS - Win 2008 server R2.
This application (.Net / C #) before parallelization loads 1 core almost 100%. The obvious solution for making a profit was to use the .Net 4 parallel task library to speed up applications by X times. Suppose that the parallel part of the application is really suitable - there is no blocking between the threads (there are no shared resources, each parallel task is completely independent). But, to my regret, the profit is really low - the 16-threaded application works ok. 2 times faster than serial.
Here is the first illustration - 16 threads on 16 cores
This seems strange - each task is the same, but the first 8 cores are loaded at almost the same level (~ 30%), and the remaining 8 have a gradually decreasing load.
So, I tried different configurations, for example, 8 threads on 16 cores
It seems that all 8 threads are running on 8 cores, and threads are not transferred from one core to another. Moreover, on 8 cores, the average load on the core is higher than 16.
I did some research using the profiler - each thread behaves the same way, as in the single-threaded case, in terms of the percentage of time spent on different methods. The only (and average) difference is the absolute time - it becomes more and more with an increase in the number of threads (for example, if the performance of each core decreases)
So, the main trends that I canβt explain are the more threads, the lower the average load on the core, and the integrated processor utilization is a maximum of 20-25%. And each operation in a thread is slower as the number of threads grows.
Any ideas to explain this weird thing?
UPD
After applying Server GC, the picture has changed significantly
Illustration of 8 threads on 16 cores:
Illustration of 12 threads on 16 cores:
Illustration of 15 threads on 16 cores:
Thus, it seems that processor usage increases with the number of cores. The first thing that bothers me is that all the cores are in use and the threads go from core to core, so overall performance is not so good.
Secondly, the maximum application speed is 12 cores, 15 cores give the same results, 16 cores are even slower.
What is the possible reason?