I am trying to write a web crawler and want to make HTTP requests as quickly as possible. tornado AsyncHttpClient seems to be a good choice, but the whole code sample that I saw (e.g. https://stackoverflow.com/a/125/9203/... ) basically calls AsyncHttpClient.fetch in a huge list of URLs to let the tornado queue them and end up making queries.
But what if I want to process an infinitely long (or simply large) list of URLs from a file or network? I do not want to load all the URLs into memory.
Included in Google, but cannot find the AsyncHttpClient.fetch method from the iterator. However, I found a way to do what I want using gevent: http://gevent.org/gevent.threadpool.html#gevent.threadpool.ThreadPool.imap . Is there a way to do something like this in a tornado?
One solution that I was thinking about is to only queue up so many URLs, and then add the logic to the queue when the fetch operation is complete, but I hope there will be a cleaner solution.
Any help or recommendations would be appreciated!
source share