Ipython notebook: how to parallelize an external script

Question

Ipython notebook: how to parallelize an external script

I am trying to use parallel computing from the ipython parallel library. But I know little about this, and I find that the document is difficult to read from a person who knows nothing about parallel computing.

It is ridiculous that all the training materials I found simply reuse the example in the document, with the same explanation, which from my point of view is useless.

Basically, what I would like to do is run several scripts in the background so that they run at the same time. In bash, it will be something like:

for my_file in $(cat list_file); do python pgm.py my_file & done

But the Ipython laptop bash interpreter does not handle background mode.

It seems that the solution was to use the parallel library from ipython.

I tried:

 from IPython.parallel import Client rc = Client() rc.block = True dview = rc[:2] # I take only 2 engines

But then I got stuck. I do not know how to run the same script or pgm twice (or more) at the same time.

Thanks.

+4

parallel-processing ipython ipython-notebook ipython-parallel jupyter

jrjc Jun 19 '14 at 17:00

source share

2 answers

If you are trying to execute some external scripts in parallel, you do not need to use the IPython parallel functions. Parallel execution bash replication can be achieved using the subprocess module as follows:

 import subprocess procs = [] for i in range(10): procs.append(subprocess.Popen(['ls', '/Users/shad/tmp/'], stdout=subprocess.PIPE)) results = [] for proc in procs: stdout, _ = proc.communicate() results.append(stdout)

Be careful if your subprocess generates a lot of output, the process is blocked. If you print the result (s), you get:

 print results ['file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n']

+1

shadanan Jun 19 '14 at 19:07

source share

jrjc · Accepted Answer · 2015-07-30T14:10:08+0000

A year later, I managed to get what I wanted.

1) Create a function with what you want to do on different processors. Here it just calls the script from bash with the command ! magic ipython. I assume that it will work with the call() function.

 def my_func(my_file): !python pgm.py {my_file}

Do not forget {} when using !

Please note that the path to my_file must be absolute, since clusters are the place where you started the notebook (when running jupyter notebook or ipython notebook ), which is not necessarily located where you are.

2) Run your ipython laptop cluster with the right amount of CPU. Wait 2 seconds and execute the following cell:

 from IPython import parallel rc = parallel.Client() view = rc.load_balanced_view()

3) Get a list of files that you want to process:

 files = list_of_files

4) Asynchronously compare your function with all your files with the view your engines you created. (not sure of the wording).

 r = view.map_async(my_func, files)

While it works, you can do something else on the laptop (it works in the background !). You can also call r.wait_interactive() , which lists interactively the number of files processed and the amount of time spent so far and the number of remaining files. This will prevent other cells from starting (but you can interrupt it).

And if you have more files than engines, do not worry, they will be processed as soon as the engine finishes with 1 file.

Hope this helps others!

This tutorial can help:

http://nbviewer.ipython.org/github/minrk/IPython-parallel-tutorial/blob/master/Index.ipynb

Note that I still have IPython 2.3.1 , I don’t know if it has changed with Jupyter .

Edit: still working with Jupyter, see here for the difference and potential problems you might encounter

Please note that if you use external libraries in your function, you need to import them on different machines with:

 %px import numpy as np

or

 %%px import numpy as np import pandas as pd

Same with variables and other functions, you have to push them into the engine namespace:

 rc[:].push(dict( foo=foo, bar=bar))

Ipython notebook: how to parallelize an external script

More articles: