Multiprocessing Python BETWEEN Amazon Cloud Instances

I am looking to run a lengthy python parsing process on multiple instances of Amazon EC2. The code is already running using the python multiprocessing module and can use all the kernels on the same machine.

The analysis is completely parellel, and each instance does not need to communicate with any of the others. All the work is file-based, and each process runs on each file separately ... so I only planned to install the same S3 volume on all nodes.

I was wondering if anyone knew about any tutorials (or had any suggestions) for setting up a multiprocessing environment so that I could run it on an arbitrary number of computation instances at a time.

+4
source share
3 answers

The docs give you a good tweak to do multiprocessing on multiple machines . Using s3 is a good way to share files through ec2 instances, but with multiprocessing, you can exchange queues and transfer data.

if you can use hadoop for parallel tasks, this is a very good way to extract parallelism through machines, but if you need a lot of IPC, then creating your own multiprocessing solution is not so bad.

just make sure you put your computers in the same security groups :-)

+4
source

I would use dumbo . This is a python shell for Hadoop compatible with Amazon Elastic MapReduce. Write a little wrapper around your code to integrate with dumbo. Please note that you will probably only need to work with the map without a reduction step.

0
source

I recently dug in IPython, and it looks like it supports parallel processing across multiple hosts right out of the box:

http://ipython.org/ipython-doc/stable/html/parallel/index.html

0
source

All Articles