Multiprocessor pool in Python - only one processor is used

Original question

I am trying to use multiprocessing pool in Python. This is my code:

def f(x): return x def foo(): p = multiprocessing.Pool() mapper = p.imap_unordered for x in xrange(1, 11): res = list(mapper(f,bar(x))) 

All processors are used in this code (I have 8 processors) when xrange small, like xrange(1, 6) . However, when I increase the range to xrange(1, 10) . I observe that only 1 processor works at 100%, and the rest are just idle. What could be the reason? Is it because when I increase the range, does the OS shut down the CPU due to overheating?

How can I solve this problem?

minimal, complete, testable example

To replicate my problem, I created this example: this is a simple ngram generation from a string problem.

 #!/usr/bin/python import time import itertools import threading import multiprocessing import random def f(x): return x def ngrams(input_tmp, n): input = input_tmp.split() if n > len(input): n = len(input) output = [] for i in range(len(input)-n+1): output.append(input[i:i+n]) return output def foo(): p = multiprocessing.Pool() mapper = p.imap_unordered num = 100000000 #100 rand_list = random.sample(xrange(100000000), num) rand_str = ' '.join(str(i) for i in rand_list) for n in xrange(1, 100): res = list(mapper(f, ngrams(rand_str, n))) if __name__ == '__main__': start = time.time() foo() print 'Total time taken: '+str(time.time() - start) 

When num is small (e.g. num = 10000 ), I find that all 8 CPUs are used. However, when num is substantially large (e.g. num = 100000000 ). Only 2 processors are used, and rest is idling. It is my problem.

Caution If the num value is too large, this may cause your system / VM to crash.

+7
python multiprocessing pool
source share
2 answers

Firstly, ngrams itself takes a lot of time. Although this happens, obviously, only one core. But even when it ends (which is very easy to verify by simply moving the ngrams call outside of mapper and throwing print in before and after), you still use only one core. I get 1 core per 100%, and the rest of the cores about 2%.

If you try the same thing in Python 3.4, everything will be a little different: I still get 1 core at 100%, and the rest at 15-25%.

So what is going on? Well, in multiprocessing there is always overhead for passing parameters and return values. And in your case, that overhead completely hides the actual work, which is just return x .

Here's how add-ons work: the main process is to determine the values, then put them in the queue, and then wait for the values ​​in another queue and scatter them. Each child process waits in the first queue, scatters the values ​​whether it is doing something invalid, sorts the values ​​and puts them in another queue. Access to the queues should be synchronized (using the POSIX semaphore on most platforms other than Windows, I think the NT kernel is in the Windows environment).

From what I can tell, your processes spend more than 99% of their time waiting in line or reading or writing.

This is not too unexpected, given that you have a large amount of data to process, as well as no calculations beyond the etching and scattering of this data.

If you look at the source of SimpleQueue in CPython 2.7 , pickling and spilling occurs with blocking. Thus, almost all the work of any of your background processes occurs with a lock, that is, they all end with serialization on one core.

But in CPython 3.4, etching and scattering occurs outside of blocking. And, apparently, this is enough to use 15-25% of the core. (I believe this change happened in 3.2, but I'm too lazy to track it.)

However, even at 3.4, you spend much more time waiting for access to the queue than anything, even multiprocessing overhead. That is why cores receive only up to 25%.

And, of course, you spend an order of time more on overhead than on actual work, which makes this not a big test if you are not trying to check the maximum throughput that you can get from a specific multiprocessing implementation on your machine or something something like that.

A few observations:

  • In your real code, if you can find a way to batch work on larger tasks (obviously, just relying on chunksize=1000 or the like won't help here), this will probably solve most of your problem.
  • If your giant array (or something else) never changes, you can pass it in the pool initializer, and not in every task, which will greatly fix the problem.
  • If this happens, but only from the main side of the process, it may be worth sharing, not transmitting data.
  • If you need to mutate it from child processes, see if there is a way to split the data so that each task can own a slice without conflicts.
  • Even if you need a full shared memory with explicit locking, it might be better than transmitting something so huge.
  • It might be worth downloading backport from the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries from PyPI (or switching to Python 3.x), just to transfer etching from blocking.
+7
source share

The problem is that your f() function (which runs on separate processes) does nothing special, so it does not load the processor.

ngrams() , on the other hand, does a heavy computation, but you call this function in the main process, not in the pool.

To make things clearer, think this piece of code ...

 for n in xrange(1, 100): res = list(mapper(f, ngrams(rand_str, n))) 

... is equivalent to this:

 for n in xrange(1, 100): arg = ngrams(rand_str, n) res = list(mapper(f, arg)) 

Also, intensive work with the processor is performed, performed in the main process:

 num = 100000000 rand_list = random.sample(xrange(100000000), num) 

You must either change your code so that sample() and ngrams() called inside the pool, or change f() so that it does something with heavy CPU usage, and you will see a high load on all your CPUs.

+2
source share

All Articles