I am trying to save some information between the last spider launched and the current spider. To make this possible, I found the Stats Collection supported by scrapy. My code is below:
class StatsSpider(Spider): name = 'stats' def __init__(self, crawler, *args, **kwargs): Spider.__init__(self, *args, **kwargs) self.crawler = crawler print self.crawler.stats.get_value('last_visited_url') @classmethod def from_crawler(cls, crawler): return cls(crawler) def start_requests(self): return [Request(url) for url in ['http://www.google.com', 'http://www.yahoo.com']] def parse(self, response): self.crawler.stats.set_value('last_visited_url', response.url) print'URL: %s' % response.url
When I launch my spider, I can see through debug that the stats variable is updated with new data, however, when I launch my spider again (locally), the statistics variable starts empty. How should I run my spider correctly to save data?
I run it on the console:
scrapy runspider stats.py
EDIT: If you use it on Scrapinghub you can use their api collections
source share