Confused with the rules in Scrapy's spider spider

I have one doubt about the scrapy spider. Assume this code

name = 'myspider' allowed_domains = ['domain.com'] start_urls = ['http://www.domain.com/foo/'] rules = ( Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True), ) def parse_item(self, response): hxs = HtmlXPathSelector(response) 

I want to know that the spider will first go to the source URL and parse the page, and then use the rules to extract the links

or the spider does not analyze the first page, but starts with the rules

I saw that if my rules do not match, then I do not get any results, but at least I should not parse the start page

+4
source share
1 answer

I coded a sample textbook by Michael Herman, https://github.com/mjhea0/Scrapy-Samples , which started with the BaseSpider example and went on to the CrawlSpider example. There is nothing to do with the first example, but the second example did not clear the first page - only the second page - and I did not know what I was doing wrong. However, when I ran the code from github, I realized that its code also does not scrap the first page! I believe this has something to do with the intentions of CrawlSpider vs BaseSpider, and after a little research, I came up with the following:

 from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from craigslist.items import CraigslistItem from scrapy.http import Request class MySpider(CrawlSpider): name = "CraigslistSpider" allowed_domains = ["craigslist.org"] start_urls = ["http://annapolis.craigslist.org/sof/"] rules = ( Rule (SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//p[@id="nextpage"]',)), callback="parse_items", follow= True), ) # # Need to scrape first page...so we hack-it by creating request and # sending the request to the parse_items callback # def parse_start_url(self, response): print ('**********************') request = Request("http://annapolis.craigslist.org/sof/", callback=self.parse_items) return request def parse_items(self, response): hxs = HtmlXPathSelector(response) titles = hxs.select("//p") items = [] for titles in titles: item = CraigslistItem() item ["title"] = titles.select("a/text()").extract() item ["link"] = titles.select("a/@href").extract() items.append(item) return items 

In my case, I used CrawlSpider, which required me to implement "parse_start_url" to create a request object using the same URL that was found in start_urls, i.e. on the first page. Subsequently, the first page began the purification. BTW, I have 3 days with scrapy and python!

+3
source

All Articles