Scrapy: limiting the number of requests or requests for bytes

I am using scrapy CrawlSpider and have defined a twisted reactor to control my crawler track. During the tests, I crawled a news site collecting more than a few GB of data. Mostly I'm interested in the latest stories, so I'm looking for a way to limit the number of requested pages, bytes or seconds.

Is there a general way to determine the limit

  • request_bytes
  • request_counts or
  • execution time in seconds?
+9
python scrapy
source share
1 answer

There is a scrapy.extensions.closespider.CloseSpider class in scrapy.extensions.closespider.CloseSpider . You can define the variables CLOSESPIDER_TIMEOUT , CLOSESPIDER_ITEMCOUNT , CLOSESPIDER_PAGECOUNT and CLOSESPIDER_ERRORCOUNT .

The spider closes automatically when the criteria is met: http://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.extensions.closespider

+17
source share

All Articles