Multi-threaded application in a multi-core environment - a strange load on the kernel

This environment: Xeon processor with 16 cores, OS - Win 2008 server R2.

This application (.Net / C #) before parallelization loads 1 core almost 100%. The obvious solution for making a profit was to use the .Net 4 parallel task library to speed up applications by X times. Suppose that the parallel part of the application is really suitable - there is no blocking between the threads (there are no shared resources, each parallel task is completely independent). But, to my regret, the profit is really low - the 16-threaded application works ok. 2 times faster than serial.

Here is the first illustration - 16 threads on 16 cores

16

This seems strange - each task is the same, but the first 8 cores are loaded at almost the same level (~ 30%), and the remaining 8 have a gradually decreasing load.

So, I tried different configurations, for example, 8 threads on 16 cores

eight

It seems that all 8 threads are running on 8 cores, and threads are not transferred from one core to another. Moreover, on 8 cores, the average load on the core is higher than 16.

I did some research using the profiler - each thread behaves the same way, as in the single-threaded case, in terms of the percentage of time spent on different methods. The only (and average) difference is the absolute time - it becomes more and more with an increase in the number of threads (for example, if the performance of each core decreases)

So, the main trends that I can’t explain are the more threads, the lower the average load on the core, and the integrated processor utilization is a maximum of 20-25%. And each operation in a thread is slower as the number of threads grows.

Any ideas to explain this weird thing?

UPD

After applying Server GC, the picture has changed significantly

Illustration of 8 threads on 16 cores:

8AfterServerGC

Illustration of 12 threads on 16 cores:

12AfterServerGC

Illustration of 15 threads on 16 cores:

15AfterServerGC

Thus, it seems that processor usage increases with the number of cores. The first thing that bothers me is that all the cores are in use and the threads go from core to core, so overall performance is not so good.

Secondly, the maximum application speed is 12 cores, 15 cores give the same results, 16 cores are even slower.

What is the possible reason?

+4
source share
2 answers

The pattern you see is often a sign of an I / O bottleneck. If your drives or network are fully operational to provide data for these calculations (or process the results), you can run it on a million cores without any additional benefits. I would suggest using Sysinternals Process Explorer to check network and disk I / O and see if there is a problem there before trying to understand why this is not well parallelized.

+2
source

Since it sounds like you don't have synchronization inside your method, the problem is probably related to separation.

Given that you are using TPL, work should be sent to the kernel based on the separator. However, the actual source of IEnumerable<T> not thread safe, so access through a single core is required. This, in fact, often leads to performance characteristics such as the one you show above if the actual work is small compared to the number of elements.

To do this, use the Partitioner class to first break your work items into blocks, and then iterate through the "blocks" of elements in parallel. For more information, see How to Speed Up Small Loop Blocks.

+1
source

All Articles