Python multiprocessor, big sleep data processing

I am using python 2.7.10. I read a lot of files, saved them in a large list, and then tried to call multiprocessing and pass a large list to these multiprocesses, so that each process can access this large list and do some calculations.

I use Pool as follows:

def read_match_wrapper(args): args2 = args[0] + (args[1],) read_match(*args2) pool = multiprocessing.Pool(processes=10) result=pool.map(read_match_wrapper,itertools.izip(itertools.repeat((ped_list,chr_map,combined_id_to_id,chr)),range(10))) pool.close() pool.join() 

Basically, I pass a few variables to the read_match function. To use pool.map, I write the read_match_wrapper function. I do not need any results from these processes. I just want them to run and finish.

I can get this whole process to work when the ped_list data list is pretty small. When I download all the data, for example 10G, then all the multiprocesses that it generates show "S" and do not seem to work at all.

I do not know if there is a limit to how much data you can get through the pool? I really need help with this! Thanks!

+5
source share
2 answers

From multiprocessor programming instructions:

Avoid General Condition

 As far as possible one should try to avoid shifting large amounts of data between processes. 

What you suffer is a typical symptom of a full tube that does not drain.

Multiprocessing Python. The power used by the pool has some design flaws. It basically implements a kind of message-oriented protocol over the OS protocol, which is more like a stream object.

As a result, if you send an object too large through the handset, it will be full. The sender will not be able to add content to it, and the recipient will not be able to merge it, because it is blocked, waiting for the message to end.

The proof is that your workers are sleeping, waiting for this "bold" message that never arrives.

Is there a pedovik containing file names or file contents?

In the second case, you are rather sending file names instead of content. Workers themselves can retrieve the content using simple open ().

+3
source

Instead of working with pool.map I would rather use queues. You can specify the required number of processes and assign a queue for input:

 n = 10 #number of processes tasks = multiprocessing.Queue() for i in range(n): #spawn processes multiprocessing.Process(target = read_match_wrapper, args = tasks) for element in ped_list: tasks.put(element) 

Thus, your turn is filled on the one hand and at the same time freed from the other. You may need to queue something before the processes begin. There is a chance that they end without doing anything, because the queue is empty or throws a Queue.empty exception.

0
source

All Articles