The parallel task is unstable, it uses a 100% processor from time to time

Question

The parallel task is unstable, it uses a 100% processor from time to time

I am currently testing Parallel for C #. It usually works fine, and using parallel is faster than regular foreach loops. However, from time to time (for example, 1 out of 5 times) my processor will achieve 100% utilization, as a result of which parallel tasks will be very slow. My processor setup is an i5-4570 with 8 GB of RAM. Does anyone know why this problem occurs?

Below are the codes I used to test the function

// Using normal foreach ConcurrentBag<int> resultData = new ConcurrentBag<int>(); Stopwatch sw = new Stopwatch(); sw.Start(); foreach (var item in testData) { if (item.Equals(1)) { resultData.Add(item); } } Console.WriteLine("Normal ForEach " + sw.ElapsedMilliseconds); // Using list parallel for resultData = new ConcurrentBag<int>(); sw.Restart(); System.Threading.Tasks.Parallel.For(0, testData.Count() - 1, (i, loopState) => { int data = testData[i]; if (data.Equals(1)) { resultData.Add(data); } }); Console.WriteLine("List Parallel For " + sw.ElapsedMilliseconds); // Using list parallel foreach //resultData.Clear(); resultData = new ConcurrentBag<int>(); sw.Restart(); System.Threading.Tasks.Parallel.ForEach(testData, (item, loopState) => { if (item.Equals(1)) { resultData.Add(item); } }); Console.WriteLine("List Parallel ForEach " + sw.ElapsedMilliseconds); // Using concurrent parallel for ConcurrentStack<int> resultData2 = new ConcurrentStack<int>(); sw.Restart(); System.Threading.Tasks.Parallel.For(0, testData.Count() - 1, (i, loopState) => { int data = testData[i]; if (data.Equals(1)) { resultData2.Push(data); } }); Console.WriteLine("Concurrent Parallel For " + sw.ElapsedMilliseconds); // Using concurrent parallel foreach resultData2.Clear(); sw.Restart(); System.Threading.Tasks.Parallel.ForEach(testData, (item, loopState) => { if (item.Equals(1)) { resultData2.Push(item); } }); Console.WriteLine("Concurrent Parallel ForEach " + sw.ElapsedMilliseconds);

Normal output

Normal ForEach 493

Parallel list for 315

Parallel ForEach 328 List

Parallel parallel for 286

Parallel Parallel ForEach 292

When using 100% CPU

Normal ForEach 476

Parallel List for 8047

Parallel ForEach 276 List

Parallel parallel for 281

Parallel Parallel ForEach 3960

(This can happen during any of the parallel tasks, this is only one instance)

Update

Using the PLINQ method provided by @willaien and running it 100 times, this problem no longer occurs. I still do not know why this problem will occur in the first place.

 var resultData3 = testData.AsParallel().Where(x => x == 1).ToList();

+7

c # parallel-processing cpu-usage

Andy wei Aug 17 '15 at 1:33

source share

2 answers

Luaan · Answer 1 · 2015-08-17T11:19:11+0000

First of all, be careful with Parallel - this will not protect you from thread safety issues. In the source code, you used code that does not support streams when populating the list of results. In general, you want to avoid sharing any kind of state (although access to the read-only list is in an order like this). If you really want to use Parallel.For or Parallel.ForEach for filtering and aggregation (indeed, AsParallel is what you want in these cases), you should use overload with a local-state thread - you would do the final aggregation results in a delegate localFinally (note that it is still running on another thread, so you need to ensure thread safety, however, locking in this case is fine, since you only do it once in the thread, not every iteration).

Now, the obvious first thing to try is to use a profiler. So I did it. The results are as follows:

There are hardly any memory allocations in any of these solutions. They completely overshadow the initial distribution of test data even for relatively small test data (when testing I used 1M, 10M and 100M integers).
The work being performed is performed, for example, by Parallel.For or Parallel.ForEach the bodies themselves, and not in your code (simple if (data[i] == 1) results.Add(data[i]) ).

The first thing we can say is that GC is probably not the culprit. In fact, he has no chance to escape. The second - more curious - this means that in some cases Parallel overhead is a failure - but it seems random, sometimes it works without a hitch, and sometimes it takes half a second. This usually indicates the GC, but we have already taken it.

I tried to use overload without loop state, but that did not help. I tried to limit MaxDegreeOfParallelism , but it only hurt things. Now, obviously, this code completely dominates access to the cache - there is practically no CPU work and no I / O, which will always contribute to a single-threaded solution; but even using MaxDegreeOfParallelism of 1 does not help - indeed, 2 seems to be the fastest on my system. More useless - cache access again dominates. I'm still curious: I use a server processor for tests, which has a lot of cache for all data at the same time, and although we do not do 100% sequential access (which pretty much eliminates latency), it should be fairly consistent. Despite this, we have a memory bandwidth base line in a single-threaded solution, and it is very close to the speed of the parallelized case when it works well (parallelized, I read 40% less execution time than single-threaded on a quad-core server processor for inconvenient a parallel problem - again, obviously, memory access is the limit).

So, it's time to check the source of the link to Parallel.For . In this case, he simply creates ranges based on the number of employees — one range for each. So these are not ranges - there is no overhead. The kernel simply launches a task that iterates over a given range. There are some interesting bits there - for example, the task will be “suspended” if it takes too much time. However, this does not seem to be very well suited for data - why does something like this cause random delays not related to data size? No matter how small the work is, and no matter how low MaxDegreeOfParallelism , we get a "random" slowdown. This may be a problem, but I have no idea how to verify this.

The most interesting thing is that the extension of test data does nothing with the anomaly - while it makes the “good” parallel runs much faster (even, oddly enough, even close to perfect performance in my tests), the “bad” ones are still the same badly. In fact, in some of my test runs they are absurdly bad (up to ten times the "normal" cycle).

So let's look at the threads. I artificially ran into the number of threads in ThreadPool to ensure that thread pool expansion is not a bottleneck (this should not, if everything works fine, but ...). And here is the first surprise - while the “good” one works, it just uses threads 4-8, which make sense, the “bad” runs expand on all available threads in the pool, even if there are a hundred of them. Unfortunately?

Retry diving into the source code again. Parallel internally uses Task.RunSynchronously to run the root multitask work task and Task.RunSynchronously for the result. When I look at parallel stacks, there are 97 threads that execute the body of the loop, and only one that actually has RunSynchronously on the stack (as expected, this is the main thread). The rest are simple threads of thread. Task identifiers also tell a story - there are thousands of individual tasks created during the iteration. Obviously, something is wrong here. Even if I delete the whole body of the cycle, it will happen anyway, so this is not a strange isolation either.

MaxDegreeOfParallelism explicitly setting MaxDegreeOfParallelism , this makes up for a bit - the number of threads used no longer explodes, but the number of tasks is still happening. But we have already seen that ranges are just the number of parallel tasks performed - so why create more and more tasks? Using a debugger confirms this - with a MaxDOP of four, there are only five ranges (there is some alignment that causes the fifth range). Interestingly, one of the completed ranges (how did the first complete so much ahead of the rest?) Has an index higher than the range that it iterates - this is because the “scheduler” assigns range sections in fractions up to 16.

The root task is self-replicating, so instead of explicitly launching, for example, four tasks for data processing, he expects the scheduler to replicate the task to process more data. It's hard to read - we are talking about complex multi-threaded unprotected code, but it seems that it always assigns work on slices much less than divided ranges. In my testing, the maximum slice size was 16 - far from the millions of data that I run. 16 iterations with such a body are not at all time, which can lead to many problems with the algorithm (the largest of them is the infrastructure that works more on the processor than the actual body of the iterator). In some cases, caching can affect performance even more (possibly when there are many changes in the execution time of the body), but in most cases access is fairly consistent.

TL; DR

Do not use Parallel.For and Parallel.ForEach if your iteration work is very short (on the order of milliseconds). AsParallel or just starting a single-processor iteration will most likely be much faster.

A bit longer explanation:

It seems that Parallel.For and Paraller.ForEach are intended for scenarios in which the individual elements that you repeat take a considerable amount of time (i.e. a lot of work per element, rather than tiny amounts of work on a lot of objects). They seem to work poorly when the body of the iterator is too short. If you are not doing substantial work in the body of the iterator, use AsParallel instead of Parallel.* . Sweetspot seems to be somewhere less than 150 ms per slice (about 10 ms per iteration). Otherwise, Parallel.* Will spend a lot of time in its own code and is unlikely to do your iteration (in my case, the usual number was somewhere around 5-10% in the body - vaguely bad).

Unfortunately, I did not find any warnings about this on MSDN - even samples collect significant amounts of data there, but there are no hints of a terrible performance hit. Having tested the same code sample on my computer, I found that it is really often slower than single-threaded iteration, and in better times, a little faster (saving time by 30-40% when working on four processor cores is not very effective).

EDIT:

Villain found a mention on MSDN about this very problem and how to solve it - https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx . The idea is to use a custom separator and iterate over it into the body of Parallel.For (for example, a loop in a Parallel.For loop). However, in most cases, using AsParallel is probably still the best choice - simple loop bodies usually mean some kind of mapping / decreasing operation, while AsParallel and LINQ are generally good at that. For example, your sample code can be rewritten simply like this:

 var result = testData.AsParallel().Where(i => i == 1).ToList();

The only time that using AsParallel is a bad idea coincides with all other LINQs - when your loop body has side effects. Some may be tolerant, but it is safer to avoid them altogether.

willaien · Answer 2 · 2015-08-17T02:24:31+0000

After some analysis, you probably don’t even add to these collections: 100,000,000 items are still slightly smaller than the key search space (about 2.1 billion), so they probably won’t have any items added, or just one or two.

Regarding a specific problem, while I can reproduce it, I cannot give a direct answer about why this is happening, but I suspect that this is due to serious conflicts around the memory bus in some way, and how it processes creating partitions and threads. Limiting the number of threads in the current number of processors seems to help, but it does not completely solve the problem.

All that is said, the PLINQ version looks much faster and more consistent:

 var resultData = testData.AsParallel().Where(x => x == 1).ToList();

Edit: It seems like this is a semi-obscure, but known issue, more information is available here: https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx

The parallel task is unstable, it uses a 100% processor from time to time

Normal output

When using 100% CPU

Update

More articles: