How to use scrapy to scan multiple pages?

Question

How to use scrapy to scan multiple pages?

All the examples I found in Scrapy talk about how to crawl a single page, pages with the same URL pattern, or all pages of a website. I need to scan a series of pages A, B, C, where in you got a link to B, etc. For example, the structure of the website:

A ----> B ---------> C D E

I need to scan all C pages, but to get a link to C, I need to scan to A and B. Any hints?

+6

python scrapy

tapioco123 Dec 15 '13 at 19:13

source share

2 answers

Guy gavriely · Answer 1 · 2013-12-16T00:36:02+0000

see the scrapy request structure to scan such a chain, you will have to use the callback parameter, as shown below:

 class MySpider(BaseSpider): ... # spider starts here def parse(self, response): ... # A, D, E are done in parallel, A -> B -> C are done serially yield Request(url=<A url>, ... callback=parseA) yield Request(url=<D url>, ... callback=parseD) yield Request(url=<E url>, ... callback=parseE) def parseA(self, response): ... yield Request(url=<B url>, ... callback=parseB) def parseB(self, response): ... yield Request(url=<C url>, ... callback=parseC) def parseC(self, response): ... def parseD(self, response): ... def parseE(self, response): ...

Karim Tabet · Answer 2 · 2013-12-16T00:52:01+0000

Here is an example of a spider that I wrote for my project:

 from scrapy.contrib.spiders import CrawlSpider from scrapy.selector import HtmlXPathSelector from scrapy.http import Request from yoMamaSpider.items import JokeItem from yoMamaSpider.striputils import stripcats, stripjokes import re class Jokes4UsSpider(CrawlSpider): name = 'jokes4us' allowed_domains = ['jokes4us.com'] start_urls = ["http://www.jokes4us.com/yomamajokes/"] def parse(self, response): hxs = HtmlXPathSelector(response) links = hxs.select('//a') for link in links: url = ''.join(link.select('./@href').extract()) relevant_urls = re.compile( 'http://www\.jokes4us\.com/yomamajokes/yomamas([a-zA-Z]+)') if relevant_urls.match(url): yield Request(url, callback=self.parse_page) def parse_page(self, response): hxs = HtmlXPathSelector(response) categories = stripcats(hxs.select('//title/text()').extract()) joke_area = hxs.select('//p/text()').extract() for joke in joke_area: joke = stripjokes(joke) if len(joke) > 15: yield JokeItem(joke=joke, categories=categories)

I think the parse method is what you need: It looks at every link on the start_urls page, then uses some regular expression to decide if it is a relevant_value (i.e. the Url I would like to clear) if it matters - it crosses the page with yield Request (url, callback = self.parse_page), which calls the parse_page method.

That's what you need?

How to use scrapy to scan multiple pages?

More articles: