What strategy to use for multiprocessing in python

Question

What strategy to use for multiprocessing in python

I am completely new to multiprocessing. I read the documentation on the multiprocessor module. I read about the pool, threads, queues, etc., but I completely lost.

What I want to do with multiprocessing is that we convert my humble HTTP downloader to work with multiple workers. What I'm doing at the moment is to load the page, parse it into a page to get interesting links. Continue until all interesting links load. Now I want to implement this using multiprocessing. But at the moment I have no idea how to organize this workflow. I had two thoughts on this. Firstly, I was thinking of two lines. One is for links that need to be downloaded, and another is for links that need to be analyzed. One worker, loads the pages and adds them to the queue, which is designed for the elements that need to be analyzed. And another process parses the page and adds links,which he finds interesting for another line. The problems that I expect from this approach; First of all, why load one page at a time and analyze the page at a time. Moreover, as one process knows that there are elements that will be added to the queue later, after it has exhausted all the elements from the queue.

Another approach that I was thinking about using is this. There is a function that can be called with a URL as an argument. This function loads the document and begins to sort it by links. Each time he encounters an interesting link, he instantly creates a new thread that performs an identical function as himself. The problem that I am encountering in this approach is how can I track all the processes spawned everywhere and how can I find out if there are more processes to start. As well as how to limit the maximum number of processes.

So, I am completely lost. Can anyone suggest a good strategy and maybe show some code examples on how to go with the idea.

+5

python multithreading multiprocessing

yasar 24 sept '11 at 19:36

source share

2

, , - , , ; ( , ), , URL-.

" " , . , , , , .

" " , .

node ( , ) . node , , node , ( ). , ( ), .

multiprocessing, , , , ; , , - node . , , , ; , .

+1

SingleNegationElimination 24 . '11 20:41

unutbu · Accepted Answer · 2011-09-24T20:31:06+0000

, . ( @Voo, ).

import multiprocessing as mp
import logging
import Queue
import time

logger=mp.log_to_stderr(logging.DEBUG)  # or, 
# logger=mp.log_to_stderr(logging.WARN) # uncomment this to silence debug and info messages

def worker(url_queue,seen):
    while True:
        url=url_queue.get()
        if url not in seen:
            logger.info('downloading {u}'.format(u=url))
            seen[url]=True
            # Replace this with code to dowload url
            # urllib2.open(...)
            time.sleep(0.5)
            content=url
            logger.debug('parsing {c}'.format(c=content))
            # replace this with code that finds interesting links and
            # puts them in url_queue
            for i in range(3):
                if content<5:
                    u=2*content+i-1
                    logger.debug('adding {u} to url_queue'.format(u=u))
                    time.sleep(0.5)
                    url_queue.put(u)
        else:
            logger.debug('skipping {u}; seen before'.format(u=url))
        url_queue.task_done()

if __name__=='__main__':
    num_workers=4
    url_queue=mp.JoinableQueue()
    manager=mp.Manager()
    seen=manager.dict()

    # prime the url queue with at least one url
    url_queue.put(1)
    downloaders=[mp.Process(target=worker,args=(url_queue,seen))
                 for i in range(num_workers)]
    for p in downloaders:
        p.daemon=True
        p.start()
    url_queue.join()

(4) .
JoinableQueue, url_queue.
URL- url_queue, URL- url_queue.
url_queue.task_done().
url_queue.join(). , task_done url_queue.
daemon True, , .

, , Doug Hellman Python .

What strategy to use for multiprocessing in python

More articles: