Multi-core batch processing

I want to batch process files on multiple cores. I have the following script:

  • I have 20 files.
  • I have a function that takes a file name, processes it and produces an integer result. I want to apply this function to all 20 files, calculate the integer output for each of them, and finally summarize the individual outputs and print the final result.
  • Since I have 4 cores, I can only process 4 files. Thus, I want to start 5 rounds of processing 4 files at a time (4 * 5 = 20).
  • I want to create 4 processes, each of which processes 5 files one after another (the 1st process processes the files 0, 4, 8, 12, 16, the 2nd process processes the files 1, 5, 9, 13, 17, etc. d.) ..

How do I achieve this? I am confused by multiprocessing.Pool() , multiprocessing.Process() and other options.

Thanks.

+4
source share
3 answers

To demonstrate Pool , I assume that your work function, which consumes a file name and returns a number, is called work and that 20 files are marked as 1.txt , ... 20.txt . One way to establish this would be as follows:

 from multiprocessing import Pool pool = Pool(processes=4) result = pool.map_async(work, ("%d.txt"%n for n in xrange(1,21))) print sum(result.get()) 

This method will follow steps 3 and 4 for you.

+6
source

It is pretty simple.

 from multiprocessing import Pool def process_file(filename): return filename if __name__ == '__main__': pool = Pool() files = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] results = pool.imap(process_file, files) for result in results: print result 

Pool automatically defaults to the number of processor cores that you have. Also, make sure your processing function is imported from a file and that your multiprocessing code is inside if __name__ == '__main__': If not, you will make a plug and lock your computer.

+2
source

Although Jared's answer is wonderful, I would personally use the ProcessPoolExecutor from the futures module and not even bother with multiprocessing :

 with ProcessPoolExecutor(max_workers=4) as executor: result = sum(executor.map(process_file, files)) 

When it gets a little trickier, the future or futures.as_completed can be really different compared to multiprocessing equivalents. When it becomes much more complex, multiprocessing is much more flexible and powerful. But when it is trivial, in fact, it is almost difficult to tell the difference.

+2
source

All Articles