Is it possible to create a spider that inherits functionality from two basic spiders, namely SitemapSpider and CrawlSpider?
I am trying to clear data from different sites and realized that not all sites have a list of each page on the website, so you need to use CrawlSpider. But CrawlSpider goes through many unwanted pages and is a kind of overflow.
What I would like to do is something like this:
Run my spider, which is a subclass of SitemapSpider and passes regular expression consistent answers to parse_products to extract a useful informational method.
Go to the links matching the regular expression: / reviews / on the product page, and submit the data to the parse_review function.
Note: "/ reviews /" type pages are not specified in the sitemap
Extract information from / reviews / page
CrawlSpider mainly for recursive traversals and curettage
------- ADDITIONAL DATA -------
The site in question is www.flipkart.com The site has listings for a large number of products, with each page having its own details page. Along with the details page, this is the corresponding review page for the product. A link to the overview page is also available on the product details page.
Note. Overview pages are not listed in the sitemap.
class WebCrawler(SitemapSpider, CrawlSpider): name = "flipkart" allowed_domains = ['flipkart.com'] sitemap_urls = ['http://www.flipkart.com/robots.txt'] sitemap_rules = [(regex('/(.*?)/p/(.*?)'), 'parse_product')] start_urls = ['http://www.flipkart.com/'] rules = [Rule(LinkExtractor(allow=['/(.*?)/product-reviews/(.*?)']), 'parse_reviews'), Rule(LinkExtractor(restrict_xpaths='//div[@class="fk-navigation fk-text-center tmargin10"]'), follow=True)] def parse_product(self, response): loader = FlipkartItemLoader(response=response) loader.add_value('pid', 'value of pid') loader.add_xpath('name', 'xpath to name') yield loader.load_item() def parse_reviews(self, response): loader = ReviewItemLoader(response=response) loader.add_value('pid','value of pid') loader.add_xpath('review_title', 'xpath to review title') loader.add_xpath('review_text', 'xpath to review text') yield loader.load_item()
source share