I coded a sample textbook by Michael Herman, https://github.com/mjhea0/Scrapy-Samples , which started with the BaseSpider example and went on to the CrawlSpider example. There is nothing to do with the first example, but the second example did not clear the first page - only the second page - and I did not know what I was doing wrong. However, when I ran the code from github, I realized that its code also does not scrap the first page! I believe this has something to do with the intentions of CrawlSpider vs BaseSpider, and after a little research, I came up with the following:
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from craigslist.items import CraigslistItem from scrapy.http import Request class MySpider(CrawlSpider): name = "CraigslistSpider" allowed_domains = ["craigslist.org"] start_urls = ["http://annapolis.craigslist.org/sof/"] rules = ( Rule (SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//p[@id="nextpage"]',)), callback="parse_items", follow= True), )
In my case, I used CrawlSpider, which required me to implement "parse_start_url" to create a request object using the same URL that was found in start_urls, i.e. on the first page. Subsequently, the first page began the purification. BTW, I have 3 days with scrapy and python!
source share