Python Scrapy does not start connection timeout

Question

Python Scrapy does not start connection timeout

I used some proxies to crawl a website. Here is what I did in settings.py:

# Retry many times since proxies often fail RETRY_TIMES = 10 # Retry on most error codes since proxies fail for different reasons RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408] DOWNLOAD_DELAY = 3 # 5,000 ms of delay DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None, 'myspider.comm.rotate_useragent.RotateUserAgentMiddleware' : 100, 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 200, 'myspider.comm.random_proxy.RandomProxyMiddleware': 300, 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 400, }

And I also have proxy middleware that has the following methods:

 def process_request(self, request, spider): log('Requesting url %s with proxy %s...' % (request.url, proxy)) def process_response(self, request, response, spider): log('Response received from request url %s with proxy %s' % (request.url, proxy if proxy else 'nil')) def process_exception(self, request, exception, spider): log_msg('Failed to request url %s with proxy %s with exception %s' % (request.url, proxy if proxy else 'nil', str(exception))) #retry again. return request

Since the proxy server is sometimes not very stable, process_exception often raises a lot of request error messages. The problem here is that the failed request was never checked again.

As shown above, I set the RETRY_TIMES and RETRY_HTTP_CODES parameters, and also returned a retry request in the process_exception method for the proxy middleware.

Why does scrapy never retry a denial request, or how can I make sure the request is verified at least RETRY_TIMES, which I set in settings.py?

+7

python web-scraping scrapy screen-scraping

David Dec 12 '13 at 1:55

source share

2 answers

it looks like your middleware for proxy -> process_response doesn't play by the rules and therefore breaks the middlewares chain

process_response () should either: return a Response object, return a Request object, or throw an IgnoreRequest exception.
If it returns a response (it may be the same answer or a new one), this response will continue to be processed using process_response () of the next middleware in the chain.
...

0

Guy gavriely Dec 12 '13 at 2:38

source share

David · Accepted Answer · 2013-12-12T03:30:30+0000

Thanks for the help from @nyov channel Scrapy IRC.

'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 200,
'myspider.comm.random_proxy.RandomProxyMiddleware': 300,

Here, the Retry middleware is run first, so it will repeat the request before it moves on to the proxy middleware. In my situation, scrapy requires a proxy to scan the site, or it will timeout endlessly.

So, I canceled the priority between these two downloads of medium goods:

'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 300,
'myspider.comm.random_proxy.RandomProxyMiddleware': 200,

Python Scrapy does not start connection timeout

More articles: