Heavy multi-threaded memory after reaching a certain number of cores

We are testing our software for the first time on a machine with s> 12 cores for scalability, and we are faced with an unpleasant performance hit after adding a 12th thread. After spending a couple of days on this, we are at an impasse about what to do next.

The test system is a dual Opteron 6174 (2x12 cores) with 16 GB of memory, Windows Server 2008 R2.

Basically, performance peaks from 10 to 12 threads then drop off a cliff and soon work at about the same speed as with about 4 threads. The fallout is pretty steep, and over 16-20 threads it reaches the bottom of the bandwidth. We tested both a single process with several threads, and with several processes that execute single threads - the results are almost the same. Processing is quite intense in memory and has several disk space.

We are pretty sure that this is a memory bottleneck, but we do not think this is a cache problem. The following data are available:

  • CPU usage continues to increase from 50 to 100% when scaling from 12 to 24 threads. If we had synchronization / deadlock problems, we expected CPU utilization to exceed 100%.
  • Testing when copying a large number of files in the background has very little effect on processing speed. We believe this eliminates disk I / O as a bottleneck.
  • The fixing cost is only about 4 GB, so we should be well below the threshold at which paging will become a problem.
  • The best data is taken from AMD CodeAnalyst toolkit. CodeAnalyst shows that the Windows kernel goes from taking about 6% of the processor time with 12 threads to 80-90% of the processor time when using 24 threads. The vast majority of this time is spent on the ExAcquireResourceSharedLite (50%) and KeAcquireInStackQueuedSpinLockAtDpcLevel (46%) functions. Here are the main points of the change in the kernel coefficient during the transition from 12 threads to starting from 24:

    Instructions: 5.56 (times more)
    Hours: 10.39
    Memory Operations: 4.58
    Cache skip rate: 0.25 (actual cache cache ratio is 0.1, 4 times less than 12 threads)
    Average cache wait time: 8.92
    Total cache latency: 6.69
    Bank invoice conflict: 11.32
    Conflict at the Mem bank store: 2.73
    Mem redirected: 7.42

We thought this might be proof of the problem described in this article , however, we found that linking each workflow / process to a specific core did not improve the results at all (if something, the performance got a little worse).

So where are we. Any ideas on the exact cause of this bottleneck or how we can avoid it?

+4
source share
2 answers

I'm not sure I fully understand the problems, so I can offer you a solution, but from what you explained, I can have alternative viewpoints that can help you.

I program in C, so what works for me may not be applicable in your case.

Your processors have 12MB L3 and 6MB L2, which are large, but in my opinion they are rarely large enough!

You are probably using rdtsc to sync individual partitions. When I use it, I have a statistics structure to which I send measurement results from different parts of the executable code. The average, minimum, maximum, and number of observations are obvious, but there is also a standard deviation, as it can help you decide whether to investigate a large maximum value or not. The standard deviation needs to be calculated only when it needs to be read: before that, it can be stored in its components (n, sum x, sum x ^ 2). If you do not synchronize very short sequences, you can omit the previous synchronization instruction. Make sure that you measure time overhead, at least to exclude them as insignificant.

When I program multithreading, I try to make the task of each core / thread as much as possible "limited memory". By a limited restriction, I mean not to do something that requires unnecessary access to memory. Unnecessary access to memory usually means as much embedded code and as convenient access to the OS as possible. For me, the OS is a great unknown in terms of how much memory is working, the call will be generated, so I'm trying to minimize its requirements. In the same way, but as a rule, to a lesser extent affecting performance, I try to avoid calling application functions: if they should be called, I would prefer that they do not call many other things.

In the same way, I minimize memory allocation: if I need several, I add them together to one, and then divide one large allocation into smaller ones. This will help later allocations in that they will have to go through fewer blocks before finding the returned block. I only block initialization when it is absolutely necessary.

I am also trying to reduce code size by nesting. When moving / installing small blocks of memory, I prefer to use intrinsics based on rep movsb and rep stosb rather than calling memcopy / memset, which are usually optimized for large blocks and not particularly limited in size.

I just recently started using spinlocks, but I implement them so that they become inline (something is better than calling the OS!). I assume that the alternative to the OS is the critical sections, and although they are fast local spindle blocks faster. Since they perform additional processing, this means that they do not allow processing applications during this time. This is the implementation:

inline void spinlock_init (SPINLOCK *slp) { slp->lock_part=0; } inline char spinlock_failed (SPINLOCK *slp) { return (char) __xchg (&slp->lock_part,1); } 

Or more complex (but not too):

 inline char spinlock_failed (SPINLOCK *slp) { if (__xchg (&slp->lock_part,1)==1) return 1; slp->count_part=1; return 0; } 

And release

 inline void spinlock_leave (SPINLOCK *slp) { slp->lock_part=0; } 

or

 inline void spinlock_leave (SPINLOCK *slp) { if (slp->count_part==0) __breakpoint (); if (--slp->count_part==0) slp->lock_part=0; } 

Part of the account is what I brought from the built-in (and other programming), where it is used to handle nested interrupts.

I am also a big fan of IOCP for their efficiency in handling events and I / O streams, but your description does not indicate if your application can use them. In any case, you seem to save on them, which is good.

+2
source

To address your markers:

1) If you have 12 cores at 100% use and 12 cores are inactive, then your overall CPU usage will be 50%. If your synchronization is a spinlock-esque function, your threads will still saturate their processors, even if useful work is not done.

2) skipped

3) I agree with your conclusion. In the future, you should know that Perfmon has a counter: Process \ Page Faults / sec, which can confirm this.

4) If you do not have personal characters for ntoskrnl, CodeAnalyst will not be able to tell you the correct function names in your profile. Rather, it can only point to the closest function for which it has characters. Can you get stack traces using profiles with CodeAnalyst? This can help you determine what operation your threads are doing, which controls the use of the kernel.

In addition, my former Microsoft team provided a number of tools and recommendations for analyzing performance here , including stack tracing on processor profiles.

+1
source

All Articles