How to parallelize file downloads?

Question

How to parallelize file downloads?

I can upload the file at a time:

import urllib.request urls = ['foo.com/bar.gz', 'foobar.com/barfoo.gz', 'bar.com/foo.gz'] for u in urls: urllib.request.urlretrieve(u)

I could try subprocess as such:

 import subprocess import os def parallelized_commandline(command, files, max_processes=2): processes = set() for name in files: processes.add(subprocess.Popen([command, name])) if len(processes) >= max_processes: os.wait() processes.difference_update( [p for p in processes if p.poll() is not None]) #Check if all the child processes were closed for p in processes: if p.poll() is None: p.wait() urls = ['http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.en.gz', 'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.cs.gz', 'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.de.gz'] parallelized_commandline('wget', urls)

Is there a way to parallelize urlretrieve without using os.system or subprocess to cheat?

Given that I have to resort to "cheating" at the moment, is subprocess.Popen right way to load data?

When using parallelized_commandline() above, using multi-threaded but not multi-core for wget , is this normal? Is there a way to make it multi-core rather than multi-threaded?

+5

python python-3.x subprocess download wget

alvas Aug 3 '15 at 9:58

source share

1 answer

jfs · Accepted Answer · 2015-08-03T19:26:44+0000

You can use the thread pool to download files in parallel:

 #!/usr/bin/env python3 from multiprocessing.dummy import Pool # use threads for I/O bound tasks from urllib.request import urlretrieve urls = [...] result = Pool(4).map(urlretrieve, urls) # download 4 files at a time

You can also upload multiple files at the same time in one stream using asyncio :

 #!/usr/bin/env python3 import asyncio import logging from contextlib import closing import aiohttp # $ pip install aiohttp @asyncio.coroutine def download(url, session, semaphore, chunk_size=1<<15): with (yield from semaphore): # limit number of concurrent downloads filename = url2filename(url) logging.info('downloading %s', filename) response = yield from session.get(url) with closing(response), open(filename, 'wb') as file: while True: # save file chunk = yield from response.content.read(chunk_size) if not chunk: break file.write(chunk) logging.info('done %s', filename) return filename, (response.status, tuple(response.headers.items())) urls = [...] logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s') with closing(asyncio.get_event_loop()) as loop, \ closing(aiohttp.ClientSession()) as session: semaphore = asyncio.Semaphore(4) download_tasks = (download(url, session, semaphore) for url in urls) result = loop.run_until_complete(asyncio.gather(*download_tasks))

where url2filename() is defined here .

How to parallelize file downloads?

More articles: