Multiple Pages per Element in Scrapy

Disclaimer: I'm pretty new to Scrapy.

To ask my question: how can I get the Item property from a link on a page and return the results to the same element?

Given the following Spider example:

class SiteSpider(Spider): site_loader = SiteLoader ... def parse(self, response): item = Place() sel = Selector(response) bl = self.site_loader(item=item, selector=sel) bl.add_value('domain', self.parent_domain) bl.add_value('origin', response.url) for place_property in item.fields: parse_xpath = self.template.get(place_property) # parse_xpath will look like either: # '//path/to/property/text()' # or # {'url': '//a[@id="Location"]/@href', # 'xpath': '//div[@class="directions"]/span[@class="address"]/text()'} if isinstance(parse_xpath, dict): # place_property is at a URL url = sel.xpath(parse_xpath['url_elem']).extract() yield Request(url, callback=self.get_url_property, meta={'loader': bl, 'parse_xpath': parse_xpath, 'place_property': place_property}) else: # parse_xpath is just an xpath; process normally bl.add_xpath(place_property, parse_xpath) yield bl.load_item() def get_url_property(self, response): loader = response.meta['loader'] parse_xpath = response.meta['parse_xpath'] place_property = response.meta['place_property'] sel = Selector(response) loader.add_value(place_property, sel.xpath(parse_xpath['xpath']) return loader 

I run these spiders on several sites, and most of them have the data that I need on one page, and everything works fine. However, some sites have certain properties on the sub-page (for example, "address" data that exists on the "Get directions" link).

The "Request Request" line really is where I have the problem. I see that the elements are moving along the pipeline, but they lack the properties that are on other URLs (IOW, those properties that receive a "Request Request"). The get_url_property basically searches for xpath in the new response variable and adds this to the element loader instance.

Is there a way to do what I'm looking for, or is there a better way? I would like to avoid a synchronous call in order to get the data I need (if this is possible even here), but if this is the best way, then perhaps this is the right approach. Thanks.

+6
source share
1 answer

If you understood correctly, you have (at least) two different cases:

  • The scanned page links to another page containing data (1 more request is required)
  • The crawl page contains data (no further request is required)

In your current code, you call yield bl.load_item() for both cases, but in the parse callback. Please note that the request you received is executed at a later point in time, therefore the element is incomplete and therefore you are missing the place_property key from the element for the first case.

Possible Solution

Possible solution (if I understand you correctly) Is using Scrapy asynchronous behavior. Only minor changes to your code are taken into account.

In the first case, you pass the element loader to another request, which then issues it. This is what you do in the isinstance if isinstance . You will need to change the return value of the get_url_property to actually get the loaded item.

In the second case, you can return the element directly, so just enter the element in the else clause.

The following code contains the changes in your example. Does your problem solve?

 def parse(self, response): # ... if isinstance(parse_xpath, dict): # place_property is at a URL url = sel.xpath(parse_xpath['url_elem']).extract() yield Request(url, callback=self.get_url_property, meta={'loader': bl, 'parse_xpath': parse_xpath, 'place_property': place_property}) else: # parse_xpath is just an xpath; process normally bl.add_xpath(place_property, parse_xpath) yield bl.load_item() def get_url_property(self, response): loader = response.meta['loader'] # ... loader.add_value(place_property, sel.xpath(parse_xpath['xpath']) yield loader.load_item() 

Associated with this problem is the question of querying the chain , for which I noted a similar solution.

+5
source

All Articles