Multiple Spider Inheritance

Is it possible to create a spider that inherits functionality from two basic spiders, namely SitemapSpider and CrawlSpider?

I am trying to clear data from different sites and realized that not all sites have a list of each page on the website, so you need to use CrawlSpider. But CrawlSpider goes through many unwanted pages and is a kind of overflow.

What I would like to do is something like this:

  • Run my spider, which is a subclass of SitemapSpider and passes regular expression consistent answers to parse_products to extract a useful informational method.

  • Go to the links matching the regular expression: / reviews / on the product page, and submit the data to the parse_review function.
    Note: "/ reviews /" type pages are not specified in the sitemap

  • Extract information from / reviews / page

  • CrawlSpider mainly for recursive traversals and curettage

------- ADDITIONAL DATA -------

The site in question is www.flipkart.com The site has listings for a large number of products, with each page having its own details page. Along with the details page, this is the corresponding review page for the product. A link to the overview page is also available on the product details page.

Note. Overview pages are not listed in the sitemap.

class WebCrawler(SitemapSpider, CrawlSpider): name = "flipkart" allowed_domains = ['flipkart.com'] sitemap_urls = ['http://www.flipkart.com/robots.txt'] sitemap_rules = [(regex('/(.*?)/p/(.*?)'), 'parse_product')] start_urls = ['http://www.flipkart.com/'] rules = [Rule(LinkExtractor(allow=['/(.*?)/product-reviews/(.*?)']), 'parse_reviews'), Rule(LinkExtractor(restrict_xpaths='//div[@class="fk-navigation fk-text-center tmargin10"]'), follow=True)] def parse_product(self, response): loader = FlipkartItemLoader(response=response) loader.add_value('pid', 'value of pid') loader.add_xpath('name', 'xpath to name') yield loader.load_item() def parse_reviews(self, response): loader = ReviewItemLoader(response=response) loader.add_value('pid','value of pid') loader.add_xpath('review_title', 'xpath to review title') loader.add_xpath('review_text', 'xpath to review text') yield loader.load_item() 
+5
source share
1 answer

You are on the right track, the only thing left at the end of your parse_product function parse_product that you have to provide all the URLs that the scanner has extracted, this way

 def parse_product(self, response): loader = FlipkartItemLoader(response=response) loader.add_value('pid', 'value of pid') loader.add_xpath('name', 'xpath to name') yield loader.load_item() # CrawlSpider defines this method to return all scraped urls. yield from self.parse(response) 

If you don't have yield from syntax just use

 for req in self.parse(response): yield req 
+2
source

Source: https://habr.com/ru/post/1214433/


All Articles