I used some proxies to crawl a website. Here is what I did in settings.py:
# Retry many times since proxies often fail RETRY_TIMES = 10
And I also have proxy middleware that has the following methods:
def process_request(self, request, spider): log('Requesting url %s with proxy %s...' % (request.url, proxy)) def process_response(self, request, response, spider): log('Response received from request url %s with proxy %s' % (request.url, proxy if proxy else 'nil')) def process_exception(self, request, exception, spider): log_msg('Failed to request url %s with proxy %s with exception %s' % (request.url, proxy if proxy else 'nil', str(exception)))
Since the proxy server is sometimes not very stable, process_exception often raises a lot of request error messages. The problem here is that the failed request was never checked again.
As shown above, I set the RETRY_TIMES and RETRY_HTTP_CODES parameters, and also returned a retry request in the process_exception method for the proxy middleware.
Why does scrapy never retry a denial request, or how can I make sure the request is verified at least RETRY_TIMES, which I set in settings.py?
python web-scraping scrapy screen-scraping
David
source share