Scrapy Spider - data storage through statistics collection

I am trying to save some information between the last spider launched and the current spider. To make this possible, I found the Stats Collection supported by scrapy. My code is below:

class StatsSpider(Spider): name = 'stats' def __init__(self, crawler, *args, **kwargs): Spider.__init__(self, *args, **kwargs) self.crawler = crawler print self.crawler.stats.get_value('last_visited_url') @classmethod def from_crawler(cls, crawler): return cls(crawler) def start_requests(self): return [Request(url) for url in ['http://www.google.com', 'http://www.yahoo.com']] def parse(self, response): self.crawler.stats.set_value('last_visited_url', response.url) print'URL: %s' % response.url 

When I launch my spider, I can see through debug that the stats variable is updated with new data, however, when I launch my spider again (locally), the statistics variable starts empty. How should I run my spider correctly to save data?

I run it on the console:

 scrapy runspider stats.py 

EDIT: If you use it on Scrapinghub you can use their api collections

+4
source share
1 answer

You need to save this data to disk anyway (in a file or database).

The crawler object in which you write data exists only during the execution of your crawl. As soon as your spider finishes this object, it leaves memory and you have lost your data.

I suggest loading statistics from your last run in init. Then update them in the syntax, just like you. Then connect the spider_closed scrapy signal to save data when the spider is running.

If you need a spider_closed example, let me know and I will update. But there are many examples available on the Internet.

Edit: I will just give an example: fooobar.com/questions/200693 / ...

+2
source

All Articles