Set aside parts of a scar in a scar

I have the parsing method below, I use selenium for the first page load, visits certain pages that cannot be accessed by the scraper directly from the spider, collects individual URLs to another parsing method that extracts elements from the pages. The problem is that this parsing method blocks another parsing until all pages are visited. It clogs the system. I tried adding a dream, but this stops the alltogether engine, not just this parse method.

Any guidance on how I could optimize this, or at least do a sleep job so that it doesn't stop the engine?

 def parse(self, response): '''Parse first page and extract page links''' item_link_xpath = "/html/body/form/div[@class='wrapper']//a[@title='View & Apply']" pagination_xpath = "//div[@class='pagination']/input" page_xpath = pagination_xpath + "[@value=%d]" display = Display(visible=0, size=(800, 600)) display.start() browser = webdriver.Firefox() browser.get(response.url) log.msg('Loaded search results', level=log.DEBUG) page_no = 1 while True: log.msg('Scraping page: %d'%page_no, level=log.DEBUG) for link in [item_link.get_attribute('href') for item_link in browser.find_elements_by_xpath(item_link_xpath)]: yield Request(link, callback=self.parse_item_page) page_no += 1 log.msg('Using xpath: %s'%(page_xpath%page_no), level=log.DEBUG) page_element = browser.find_element_by_xpath(page_xpath%page_no) if not page_element or page_no > settings['PAGINATION_PAGES']: break page_element.click() if settings['PAGINATION_SLEEP_INTERVAL']: seconds = int(settings['PAGINATION_SLEEP_INTERVAL']) log.msg('Sleeping for %d'%seconds, level=log.DEBUG) time.sleep(seconds) log.msg('Scraped listing pages, closing browser.', level=log.DEBUG) browser.close() display.stop() 
+4
source share
1 answer

This can help:

 # delayspider.py from scrapy.spider import BaseSpider from twisted.internet import reactor, defer from scrapy.http import Request DELAY = 5 # seconds class MySpider(BaseSpider): name = 'wikipedia' max_concurrent_requests = 1 start_urls = ['http://www.wikipedia.org'] def parse(self, response): nextreq = Request('http://en.wikipedia.org') dfd = defer.Deferred() reactor.callLater(DELAY, dfd.callback, nextreq) return dfd 

Conclusion:

 $ scrapy runspider delayspider.py 2012-05-24 11:01:54-0300 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot) 2012-05-24 11:01:54-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2012-05-24 11:01:54-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 2012-05-24 11:01:54-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2012-05-24 11:01:54-0300 [scrapy] DEBUG: Enabled item pipelines: 2012-05-24 11:01:54-0300 [wikipedia] INFO: Spider opened 2012-05-24 11:01:54-0300 [wikipedia] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2012-05-24 11:01:54-0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2012-05-24 11:01:54-0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2012-05-24 11:01:56-0300 [wikipedia] DEBUG: Crawled (200) <GET http://www.wikipedia.org> (referer: None) 2012-05-24 11:02:04-0300 [wikipedia] DEBUG: Redirecting (301) to <GET http://en.wikipedia.org/wiki/Main_Page> from <GET http://en.wikipedia.org> 2012-05-24 11:02:06-0300 [wikipedia] DEBUG: Crawled (200) <GET http://en.wikipedia.org/wiki/Main_Page> (referer: http://www.wikipedia.org) 2012-05-24 11:02:11-0300 [wikipedia] INFO: Closing spider (finished) 2012-05-24 11:02:11-0300 [wikipedia] INFO: Dumping spider stats: {'downloader/request_bytes': 745, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'downloader/response_bytes': 29304, 'downloader/response_count': 3, 'downloader/response_status_count/200': 2, 'downloader/response_status_count/301': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2012, 5, 24, 14, 2, 11, 447498), 'request_depth_max': 2, 'scheduler/memory_enqueued': 3, 'start_time': datetime.datetime(2012, 5, 24, 14, 1, 54, 408882)} 2012-05-24 11:02:11-0300 [wikipedia] INFO: Spider closed (finished) 2012-05-24 11:02:11-0300 [scrapy] INFO: Dumping global stats: {} 

He uses Twisted callLater for sleep.

+3
source

All Articles