Scrapy Spider: restart the spider when done

I am trying to launch my Scrapy spider if the closed reason is connected to my Internet connection (at night the Internet takes 5 minutes). When the Internet goes down, the spider closes after 5 attempts.

I am trying to use this function inside my spider definition, trying to restart the spider on close:

def handle_spider_closed(spider, reason): relaunch = False for key in spider.crawler.stats._stats.keys(): if 'DNSLookupError' in key: relaunch = True break if relaunch: spider = mySpider() settings = get_project_settings() crawlerProcess = CrawlerProcess(settings) crawlerProcess.configure() crawlerProcess.crawl(spider) spider.crawler.queue.append_spider(another_spider) 

I tried a lot of things, such as a re-instance of a spider, but got a Reactor error already working or something like that.

I thought about executing the spider from the script, and when the spider finishes repeating it, but it doesn’t work due to the fact that the reactor is still in use.

  • My intention is to reset the spider after it closes (the spider closes because it has lost its internet connection)

Does anyone know a good and easy way to do this?

+5
source share
1 answer

I found a solution to my problem! What did I try to do?

  • Handle the spider upon failure or closure
  • Try pausing again when closing

I managed to deal with the spider error as follows:

 import time class mySpider(scrapy.Spider): name = "myspider" allowed_domains = ["google.com"] start_urls = [ "http://www.google.com", ] def handle_error(self, failure): self.log("Error Handle: %s" % failure.request) self.log("Sleeping 60 seconds") time.sleep(60) url = 'http://www.google.com' yield scrapy.Request(url, self.parse, errback=self.handle_error, dont_filter=True) def start_requests(self): url = 'http://www.google.com' yield scrapy.Request(url, self.parse, errback=self.handle_error) 
  • I used dont_filter=True to make Spider duplicate a request only when it passes through an error.
  • errback=self.handle_error allows Spider to execute a custom handle_error function
+5
source

Source: https://habr.com/ru/post/1215142/


All Articles