Parallel.ForEach may throw an "Out Of Memory" exception if you are working with an enumerated large object

I am trying to transfer a database in which images are stored in a database to a record in the database pointing to a file on the hard drive. I tried to use Parallel.ForEach to speed up the process using this method to query data.

However, I noticed that I am getting an OutOfMemory exception. I know that Parallel.ForEach will request a batch of enumerated elements to reduce the cost of overhead, if there is one for the intervals between requests (so your source will most likely have the next cached record in memory, if instead you execute several queries interval them). The problem is that one of the entries that I return is a 1-4 MB byte array, due to which all address space is used due to caching (the program should work in x86 mode, since the target platform will be 32- bit). the car)

Is there a way to disable caching or do less for TPL?




Here is an example program to show the problem. This should be compiled in x86 mode to show the problem, if it takes a long time or does not happen on your machine, increase the size of the array (I found that 1 << 20 takes about 30 seconds on my machine and 4 << 20 was almost instantly)

 class Program { static void Main(string[] args) { Parallel.ForEach(CreateData(), (data) => { data[0] = 1; }); } static IEnumerable<byte[]> CreateData() { while (true) { yield return new byte[1 << 20]; //1Mb array } } } 
+58
c # out-of-memory task-parallel-library large-data
Aug 08 2018-11-11T00:
source share
4 answers

The default options for Parallel.ForEach work well when the task is tied to the CPU and scales linearly . When the task is related to the CPU, everything works fine. If you have a quad-core processor and no other processes, then Parallel.ForEach uses all four processors. If you have a quad-core processor and some other process on your computer uses one full processor, then Parallel.ForEach uses about three processors.

But if the task is not tied to the CPU, then Parallel.ForEach continues to run the tasks, trying to hold all the processors. But no matter how many tasks are executed in parallel, there is always more unused CPU power, and therefore it continues to create tasks.

How can you determine if your task is CPU related? Hopefully just inspecting it. If you factor prime numbers, this is obvious. But other cases are not so obvious. An empirical way to determine if your task is processor-related is to limit the maximum degree of parallelism ParallelOptions.MaximumDegreeOfParallelism and observe how your program works. If your task is processor-related, you should see the same pattern in a quad-core system:

  • ParallelOptions.MaximumDegreeOfParallelism = 1 : use one full processor or 25% of the processor load.
  • ParallelOptions.MaximumDegreeOfParallelism = 2 : use two processors or 50% of the processor load.
  • ParallelOptions.MaximumDegreeOfParallelism = 4 : use all processors or 100% processor utilization.

If it behaves like this, you can use the default Parallel.ForEach options and get good results. Linear CPU utilization means good task planning.

But if I run the sample application on my Intel i7, I get about 20% of the processor load no matter what maximum degree of parallelism I set. Why is this? So much memory is allocated that the garbage collector blocks threads. An application is tied to resources, and a resource is memory.

Similarly, an I / O-related task that performs lengthy queries against a database server will also never be able to efficiently use all the CPU resources available on the local computer. And in such cases, the task scheduler cannot “know when to stop,” starting new tasks.

If your task is not CPU related or CPU usage does not scale linearly with the maximum degree of parallelism, then you should advise Parallel.ForEach not to run too many tasks at the same time. The easiest way is to specify a number that allows some parallelism to overlap tasks related to I / O binding, but not so much that you suppress the local computer's resource requirements or redirect any remote servers. For best results, trial and error are used:

 static void Main(string[] args) { Parallel.ForEach(CreateData(), new ParallelOptions { MaxDegreeOfParallelism = 4 }, (data) => { data[0] = 1; }); } 
+90
Aug 08
source share

So, although what Rick suggested is definitely an important point, another thing that I think is missing is the partitioning discussion.

Parallel::ForEach will use the default implementation of Partitioner<T> , which for IEnumerable<T> , which does not have a known length, will use a piece of split strategy. This means that every workflow that Parallel::ForEach is going to use for the data set will read a certain number of elements from IEnumerable<T> , which then will be processed only by this thread (ignoring the theft of work now). He does this in order to save the costs of constantly returning to the source, and allocate new work and schedule it for another workflow. So this is usually good. However, in your specific scenario, imagine that you are on a quad core and you have set MaxDegreeOfParallelism from> to 4 threads for your work, and now each of them pulls a piece of 100 elements from your IEnumerable<T> . Well, that 100-400 megabytes right for this particular workflow, right?

So how do you solve this? Easy, you write a custom implementation of Partitioner<T> . Now, chunking is still useful in your case, so you probably don't want to go with a single element splitting strategy, because then you have to enter the overhead with all the necessary coordination of tasks. Instead, I would write a custom version that you can configure using the application until you find the optimal balance for your workload. The good news is that when writing such an implementation is pretty straightforward, you really don’t even need to write it yourself, because the PFX team has already done this, and put this in a parallel program design project .

+40
Aug 08 '11 at 2:52 a.m.
source share

This problem has everything related to delimiters, but not to the degree of parallelism. The solution is to implement a custom data separator.

If the data set is large, it seems that the TPL mono implementation is guaranteed to run out of memory. This has happened to me recently (in fact, I worked on this loop and found that the memory increased linearly until it gave me an OOM exception).

After tracking down the problem, I found that by default mono will share the enumerator using the EnumerablePartitioner class. This class has the behavior in that every time it issues data to the task, these are “chunks” given by an increasing (and unchanged) coefficient 2. So, the first time the task asks for data, it gets a piece of size 1, next time size 2 * 1 = 2, next time 2 * 2 = 4, then 2 * 4 = 8, etc. Etc. As a result, the amount of data transferred to the task, and therefore stored in memory, simultaneously increases with the length of the task, and if a lot of data is processed, an exception from memory will inevitably arise.

Presumably, the initial reason for this behavior is that it wants to avoid each thread returning several times to receive data, but seems to be based on the assumption that all the processed data can fit into memory (not when reading from large files).

This problem can be avoided by using a custom delimiter, as mentioned earlier. One common example of one that simply returns data to each task, one item at a time:

https://gist.github.com/evolvedmicrobe/7997971

Just create an instance of this class and pass it to Parallel.For instead of the enumerated one

+12
Dec 17 '13 at 0:45
source share

Although using a custom separator is by far the most “correct” answer, a simpler solution allows the garbage collector to catch up. In the case where I tried, I made repeated calls to the parallel for loop inside the function. Despite exiting the function each time, the amount of memory used by the program continued to increase linearly, as described here. I added:

 //Force garbage collection. GC.Collect(); // Wait for all finalizers to complete before continuing. GC.WaitForPendingFinalizers(); 

and although he is not super fast, he solved the memory problem. Presumably, with high CPU and memory usage, the garbage collector is inefficient.

0
Apr 05 '19 at 17:06 on
source share



All Articles