Scrapy: limiting the number of requests or requests for bytes

Question

Scrapy: limiting the number of requests or requests for bytes

I am using scrapy CrawlSpider and have defined a twisted reactor to control my crawler track. During the tests, I crawled a news site collecting more than a few GB of data. Mostly I'm interested in the latest stories, so I'm looking for a way to limit the number of requested pages, bytes or seconds.

Is there a general way to determine the limit

request_bytes
request_counts or
execution time in seconds?

+9

python scrapy

Jon Oct 3 '13 at 13:32

source share

1 answer

Jon · Accepted Answer · 2013-10-03T14:34:49+0000

There is a scrapy.extensions.closespider.CloseSpider class in scrapy.extensions.closespider.CloseSpider . You can define the variables CLOSESPIDER_TIMEOUT , CLOSESPIDER_ITEMCOUNT , CLOSESPIDER_PAGECOUNT and CLOSESPIDER_ERRORCOUNT .

The spider closes automatically when the criteria is met: http://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.extensions.closespider

Scrapy: limiting the number of requests or requests for bytes

More articles: