I am using scrapy CrawlSpider and have defined a twisted reactor to control my crawler track. During the tests, I crawled a news site collecting more than a few GB of data. Mostly I'm interested in the latest stories, so I'm looking for a way to limit the number of requested pages, bytes or seconds.
Is there a general way to determine the limit
- request_bytes
- request_counts or
- execution time in seconds?
python scrapy
Jon
source share