Python Scrapy does not start connection timeout

I used some proxies to crawl a website. Here is what I did in settings.py:

# Retry many times since proxies often fail RETRY_TIMES = 10 # Retry on most error codes since proxies fail for different reasons RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408] DOWNLOAD_DELAY = 3 # 5,000 ms of delay DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None, 'myspider.comm.rotate_useragent.RotateUserAgentMiddleware' : 100, 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 200, 'myspider.comm.random_proxy.RandomProxyMiddleware': 300, 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 400, } 

And I also have proxy middleware that has the following methods:

 def process_request(self, request, spider): log('Requesting url %s with proxy %s...' % (request.url, proxy)) def process_response(self, request, response, spider): log('Response received from request url %s with proxy %s' % (request.url, proxy if proxy else 'nil')) def process_exception(self, request, exception, spider): log_msg('Failed to request url %s with proxy %s with exception %s' % (request.url, proxy if proxy else 'nil', str(exception))) #retry again. return request 

Since the proxy server is sometimes not very stable, process_exception often raises a lot of request error messages. The problem here is that the failed request was never checked again.

As shown above, I set the RETRY_TIMES and RETRY_HTTP_CODES parameters, and also returned a retry request in the process_exception method for the proxy middleware.

Why does scrapy never retry a denial request, or how can I make sure the request is verified at least RETRY_TIMES, which I set in settings.py?

+7
python web-scraping scrapy screen-scraping
source share
2 answers

Thanks for the help from @nyov channel Scrapy IRC.

'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 200,
'myspider.comm.random_proxy.RandomProxyMiddleware': 300,

Here, the Retry middleware is run first, so it will repeat the request before it moves on to the proxy middleware. In my situation, scrapy requires a proxy to scan the site, or it will timeout endlessly.

So, I canceled the priority between these two downloads of medium goods:

'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 300,
'myspider.comm.random_proxy.RandomProxyMiddleware': 200,

+5
source share

it looks like your middleware for proxy -> process_response doesn't play by the rules and therefore breaks the middlewares chain

process_response () should either: return a Response object, return a Request object, or throw an IgnoreRequest exception.

If it returns a response (it may be the same answer or a new one), this response will continue to be processed using process_response () of the next middleware in the chain.

...

0
source share

All Articles