Tornado: AsyncHttpClient.fetch from an iterator?

I am trying to write a web crawler and want to make HTTP requests as quickly as possible. tornado AsyncHttpClient seems to be a good choice, but the whole code sample that I saw (e.g. https://stackoverflow.com/a/125/9203/... ) basically calls AsyncHttpClient.fetch in a huge list of URLs to let the tornado queue them and end up making queries.

But what if I want to process an infinitely long (or simply large) list of URLs from a file or network? I do not want to load all the URLs into memory.

Included in Google, but cannot find the AsyncHttpClient.fetch method from the iterator. However, I found a way to do what I want using gevent: http://gevent.org/gevent.threadpool.html#gevent.threadpool.ThreadPool.imap . Is there a way to do something like this in a tornado?

One solution that I was thinking about is to only queue up so many URLs, and then add the logic to the queue when the fetch operation is complete, but I hope there will be a cleaner solution.

Any help or recommendations would be appreciated!

+4
source share
1 answer

I would do this with a queue and several employees in a variation https://github.com/tornadoweb/tornado/blob/master/demos/webspider/webspider.py

 import tornado.queues from tornado import gen from tornado.httpclient import AsyncHTTPClient from tornado.ioloop import IOLoop NUM_WORKERS = 10 QUEUE_SIZE = 100 q = tornado.queues.Queue(QUEUE_SIZE) AsyncHTTPClient.configure(None, max_clients=NUM_WORKERS) http_client = AsyncHTTPClient() @gen.coroutine def worker(): while True: url = yield q.get() try: response = yield http_client.fetch(url) print('got response from', url) except Exception: print('failed to fetch', url) finally: q.task_done() @gen.coroutine def main(): for i in range(NUM_WORKERS): IOLoop.current().spawn_callback(worker) with open("urls.txt") as f: for line in f: url = line.strip() # When the queue fills up, stop here to wait instead # of reading more from the file. yield q.put(url) yield q.join() if __name__ == '__main__': IOLoop.current().run_sync(main) 
+3
source

All Articles