Multiprocessing Python BETWEEN Amazon Cloud Instances

Question

Multiprocessing Python BETWEEN Amazon Cloud Instances

I am looking to run a lengthy python parsing process on multiple instances of Amazon EC2. The code is already running using the python multiprocessing module and can use all the kernels on the same machine.

The analysis is completely parellel, and each instance does not need to communicate with any of the others. All the work is file-based, and each process runs on each file separately ... so I only planned to install the same S3 volume on all nodes.

I was wondering if anyone knew about any tutorials (or had any suggestions) for setting up a multiprocessing environment so that I could run it on an arbitrary number of computation instances at a time.

+4

python amazon-ec2 multiprocessing python-multithreading

Judowill Jun 23 '11 at 16:00

source share

3 answers

I would use dumbo . This is a python shell for Hadoop compatible with Amazon Elastic MapReduce. Write a little wrapper around your code to integrate with dumbo. Please note that you will probably only need to work with the map without a reduction step.

0

Spike gronim Jun 23 '11 at 20:56

source share

I recently dug in IPython, and it looks like it supports parallel processing across multiple hosts right out of the box:

http://ipython.org/ipython-doc/stable/html/parallel/index.html

0

hwjp Jun 28 '11 at 9:26

source share

underrun · Accepted Answer · 2011-06-27T02:01:55+0000

The docs give you a good tweak to do multiprocessing on multiple machines . Using s3 is a good way to share files through ec2 instances, but with multiprocessing, you can exchange queues and transfer data.

if you can use hadoop for parallel tasks, this is a very good way to extract parallelism through machines, but if you need a lot of IPC, then creating your own multiprocessing solution is not so bad.

just make sure you put your computers in the same security groups :-)

Multiprocessing Python BETWEEN Amazon Cloud Instances

More articles: