Firstly, ngrams itself takes a lot of time. Although this happens, obviously, only one core. But even when it ends (which is very easy to verify by simply moving the ngrams call outside of mapper and throwing print in before and after), you still use only one core. I get 1 core per 100%, and the rest of the cores about 2%.
If you try the same thing in Python 3.4, everything will be a little different: I still get 1 core at 100%, and the rest at 15-25%.
So what is going on? Well, in multiprocessing there is always overhead for passing parameters and return values. And in your case, that overhead completely hides the actual work, which is just return x .
Here's how add-ons work: the main process is to determine the values, then put them in the queue, and then wait for the values ββin another queue and scatter them. Each child process waits in the first queue, scatters the values ββwhether it is doing something invalid, sorts the values ββand puts them in another queue. Access to the queues should be synchronized (using the POSIX semaphore on most platforms other than Windows, I think the NT kernel is in the Windows environment).
From what I can tell, your processes spend more than 99% of their time waiting in line or reading or writing.
This is not too unexpected, given that you have a large amount of data to process, as well as no calculations beyond the etching and scattering of this data.
If you look at the source of SimpleQueue in CPython 2.7 , pickling and spilling occurs with blocking. Thus, almost all the work of any of your background processes occurs with a lock, that is, they all end with serialization on one core.
But in CPython 3.4, etching and scattering occurs outside of blocking. And, apparently, this is enough to use 15-25% of the core. (I believe this change happened in 3.2, but I'm too lazy to track it.)
However, even at 3.4, you spend much more time waiting for access to the queue than anything, even multiprocessing overhead. That is why cores receive only up to 25%.
And, of course, you spend an order of time more on overhead than on actual work, which makes this not a big test if you are not trying to check the maximum throughput that you can get from a specific multiprocessing implementation on your machine or something something like that.
A few observations:
- In your real code, if you can find a way to batch work on larger tasks (obviously, just relying on
chunksize=1000 or the like won't help here), this will probably solve most of your problem. - If your giant array (or something else) never changes, you can pass it in the pool initializer, and not in every task, which will greatly fix the problem.
- If this happens, but only from the main side of the process, it may be worth sharing, not transmitting data.
- If you need to mutate it from child processes, see if there is a way to split the data so that each task can own a slice without conflicts.
- Even if you need a full shared memory with explicit locking, it might be better than transmitting something so huge.
- It might be worth downloading backport from the 3.2+ version of
multiprocessing or one of the third-party multiprocessing libraries from PyPI (or switching to Python 3.x), just to transfer etching from blocking.