I am writing a web scraper in python using httplib2 and lxml (yes, I know that I can use scrapy. Let it go by this ...) The scraper has about 15,000 pages to analyze about 400,000 elements. I have code to parse the elements to run instantly (almost), but the part that loads the page from the server is still very slow. I would like to overcome this through concurrency. However, I cannot rely on EVERY page that needs to be analyzed EVERY time. I tried with one ThreadPool (for example, multiprocessing.pool, but did with threads), which should be good, since this is an I / O binding process), but I could not think of a graceful (or working) way to get ALL threads stop when the date of the last index element is greater than the element that we processed. Right now, I'm working on a method,using two instances of ThreadPool - one for loading each page, and the other for parsing pages. Simplified code example:
import httplib2
from Queue import PriorityQueue
from multiprocessing.pool import ThreadPool
from lxml.html import fromstring
pages = [x for x in range(1000)]
page_queue = PriorityQueue(1000)
url = "http://www.google.com"
def get_page(page):
h = httplib2.Http(".cache")
resp, content = h.request(url, "GET")
tree = fromstring(str(content), base_url=url)
page_queue.put((page, tree))
print page_queue.qsize()
def parse_page():
page_num, page = page_queue.get()
print "Parsing page #" + str(page_num)
page_queue.task_done()
if __name__ == "__main__":
collect_pool = ThreadPool()
collect_pool.map_async(get_page, pages)
collect_pool.close()
parse_pool = ThreadPool()
parse_pool.apply_async(parse_page)
parse_pool.close()
parse_pool.join()
collect_pool.join()
page_queue.join()
, , , , - : , , . , parse_pool ( , , parse_pool - , collect_pool - , ). , - , (), , , .
: ? , ? -