Python - multiple simultaneous threads

I am writing a web scraper in python using httplib2 and lxml (yes, I know that I can use scrapy. Let it go by this ...) The scraper has about 15,000 pages to analyze about 400,000 elements. I have code to parse the elements to run instantly (almost), but the part that loads the page from the server is still very slow. I would like to overcome this through concurrency. However, I cannot rely on EVERY page that needs to be analyzed EVERY time. I tried with one ThreadPool (for example, multiprocessing.pool, but did with threads), which should be good, since this is an I / O binding process), but I could not think of a graceful (or working) way to get ALL threads stop when the date of the last index element is greater than the element that we processed. Right now, I'm working on a method,using two instances of ThreadPool - one for loading each page, and the other for parsing pages. Simplified code example:

#! /usr/bin/env python2

import httplib2
from Queue import PriorityQueue
from multiprocessing.pool import ThreadPool
from lxml.html import fromstring

pages = [x for x in range(1000)]
page_queue = PriorityQueue(1000)

url = "http://www.google.com"

def get_page(page):
    #Grabs google.com
    h = httplib2.Http(".cache")
    resp, content = h.request(url, "GET")
    tree = fromstring(str(content), base_url=url)
    page_queue.put((page, tree))
    print page_queue.qsize()

def parse_page():
    page_num, page = page_queue.get()
    print "Parsing page #" + str(page_num)
    #do more stuff with the page here
    page_queue.task_done()

if __name__ == "__main__":
    collect_pool = ThreadPool()
    collect_pool.map_async(get_page, pages)
    collect_pool.close()

    parse_pool = ThreadPool()
    parse_pool.apply_async(parse_page)
    parse_pool.close()


     parse_pool.join()
     collect_pool.join()
     page_queue.join()

, , , , - : , , . , parse_pool ( , , parse_pool - , collect_pool - , ). , - , (), , , . : ? , ? -

+5
1

, . httlib2. ( , , httplib2 - GIL.) lxml, C/++ ( , Global Interpreter Lock - lxml!). , , GIL. .

ThreadPool, , Pool . , , parse_pool , parse_page , . , , parse_pool.close() .

, page_queue. get_page() parse_pool, apply_async() , , , page_queue.

, collect_queue (.. collect_pool.join()), parse_pool ( , ). , parse_pool , parse_pool.join(), .

, connect_pool, HTTP- . - ; , . ; ; 1 CPU.

+6

All Articles