First of all, be careful with Parallel - this will not protect you from thread safety issues. In the source code, you used code that does not support streams when populating the list of results. In general, you want to avoid sharing any kind of state (although access to the read-only list is in an order like this). If you really want to use Parallel.For or Parallel.ForEach for filtering and aggregation (indeed, AsParallel is what you want in these cases), you should use overload with a local-state thread - you would do the final aggregation results in a delegate localFinally (note that it is still running on another thread, so you need to ensure thread safety, however, locking in this case is fine, since you only do it once in the thread, not every iteration).
Now, the obvious first thing to try is to use a profiler. So I did it. The results are as follows:
- There are hardly any memory allocations in any of these solutions. They completely overshadow the initial distribution of test data even for relatively small test data (when testing I used 1M, 10M and 100M integers).
- The work being performed is performed, for example, by
Parallel.For or Parallel.ForEach the bodies themselves, and not in your code (simple if (data[i] == 1) results.Add(data[i]) ).
The first thing we can say is that GC is probably not the culprit. In fact, he has no chance to escape. The second - more curious - this means that in some cases Parallel overhead is a failure - but it seems random, sometimes it works without a hitch, and sometimes it takes half a second. This usually indicates the GC, but we have already taken it.
I tried to use overload without loop state, but that did not help. I tried to limit MaxDegreeOfParallelism , but it only hurt things. Now, obviously, this code completely dominates access to the cache - there is practically no CPU work and no I / O, which will always contribute to a single-threaded solution; but even using MaxDegreeOfParallelism of 1 does not help - indeed, 2 seems to be the fastest on my system. More useless - cache access again dominates. I'm still curious: I use a server processor for tests, which has a lot of cache for all data at the same time, and although we do not do 100% sequential access (which pretty much eliminates latency), it should be fairly consistent. Despite this, we have a memory bandwidth base line in a single-threaded solution, and it is very close to the speed of the parallelized case when it works well (parallelized, I read 40% less execution time than single-threaded on a quad-core server processor for inconvenient a parallel problem - again, obviously, memory access is the limit).
So, it's time to check the source of the link to Parallel.For . In this case, he simply creates ranges based on the number of employees — one range for each. So these are not ranges - there is no overhead. The kernel simply launches a task that iterates over a given range. There are some interesting bits there - for example, the task will be “suspended” if it takes too much time. However, this does not seem to be very well suited for data - why does something like this cause random delays not related to data size? No matter how small the work is, and no matter how low MaxDegreeOfParallelism , we get a "random" slowdown. This may be a problem, but I have no idea how to verify this.
The most interesting thing is that the extension of test data does nothing with the anomaly - while it makes the “good” parallel runs much faster (even, oddly enough, even close to perfect performance in my tests), the “bad” ones are still the same badly. In fact, in some of my test runs they are absurdly bad (up to ten times the "normal" cycle).
So let's look at the threads. I artificially ran into the number of threads in ThreadPool to ensure that thread pool expansion is not a bottleneck (this should not, if everything works fine, but ...). And here is the first surprise - while the “good” one works, it just uses threads 4-8, which make sense, the “bad” runs expand on all available threads in the pool, even if there are a hundred of them. Unfortunately?
Retry diving into the source code again. Parallel internally uses Task.RunSynchronously to run the root multitask work task and Task.RunSynchronously for the result. When I look at parallel stacks, there are 97 threads that execute the body of the loop, and only one that actually has RunSynchronously on the stack (as expected, this is the main thread). The rest are simple threads of thread. Task identifiers also tell a story - there are thousands of individual tasks created during the iteration. Obviously, something is wrong here. Even if I delete the whole body of the cycle, it will happen anyway, so this is not a strange isolation either.
MaxDegreeOfParallelism explicitly setting MaxDegreeOfParallelism , this makes up for a bit - the number of threads used no longer explodes, but the number of tasks is still happening. But we have already seen that ranges are just the number of parallel tasks performed - so why create more and more tasks? Using a debugger confirms this - with a MaxDOP of four, there are only five ranges (there is some alignment that causes the fifth range). Interestingly, one of the completed ranges (how did the first complete so much ahead of the rest?) Has an index higher than the range that it iterates - this is because the “scheduler” assigns range sections in fractions up to 16.
The root task is self-replicating, so instead of explicitly launching, for example, four tasks for data processing, he expects the scheduler to replicate the task to process more data. It's hard to read - we are talking about complex multi-threaded unprotected code, but it seems that it always assigns work on slices much less than divided ranges. In my testing, the maximum slice size was 16 - far from the millions of data that I run. 16 iterations with such a body are not at all time, which can lead to many problems with the algorithm (the largest of them is the infrastructure that works more on the processor than the actual body of the iterator). In some cases, caching can affect performance even more (possibly when there are many changes in the execution time of the body), but in most cases access is fairly consistent.
TL; DR
Do not use Parallel.For and Parallel.ForEach if your iteration work is very short (on the order of milliseconds). AsParallel or just starting a single-processor iteration will most likely be much faster.
A bit longer explanation:
It seems that Parallel.For and Paraller.ForEach are intended for scenarios in which the individual elements that you repeat take a considerable amount of time (i.e. a lot of work per element, rather than tiny amounts of work on a lot of objects). They seem to work poorly when the body of the iterator is too short. If you are not doing substantial work in the body of the iterator, use AsParallel instead of Parallel.* . Sweetspot seems to be somewhere less than 150 ms per slice (about 10 ms per iteration). Otherwise, Parallel.* Will spend a lot of time in its own code and is unlikely to do your iteration (in my case, the usual number was somewhere around 5-10% in the body - vaguely bad).
Unfortunately, I did not find any warnings about this on MSDN - even samples collect significant amounts of data there, but there are no hints of a terrible performance hit. Having tested the same code sample on my computer, I found that it is really often slower than single-threaded iteration, and in better times, a little faster (saving time by 30-40% when working on four processor cores is not very effective).
EDIT:
Villain found a mention on MSDN about this very problem and how to solve it - https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx . The idea is to use a custom separator and iterate over it into the body of Parallel.For (for example, a loop in a Parallel.For loop). However, in most cases, using AsParallel is probably still the best choice - simple loop bodies usually mean some kind of mapping / decreasing operation, while AsParallel and LINQ are generally good at that. For example, your sample code can be rewritten simply like this:
var result = testData.AsParallel().Where(i => i == 1).ToList();
The only time that using AsParallel is a bad idea coincides with all other LINQs - when your loop body has side effects. Some may be tolerant, but it is safer to avoid them altogether.