Using Scrap Itemloader in a Loop

I want to use Scrapy on the Dmoz website, which they use in their tutorials, but instead of just reading books in the URLs of the books ( http://www.dmoz.org/Computers/Programming/Languages/Python/Books / ), using the Item / Field pairs, I want to create an Itemloader that will read the required values ​​(name, title, description).

This is my items.py file:

from scrapy.item import Item, Field from scrapy.contrib.loader import ItemLoader from scrapy.contrib.loader.processor import Identity class DmozItem(Item): title = Field( output_processor=Identity() ) link = Field( output_processor=Identity() ) desc = Field( output_processor=Identity() ) class MainItemLoader(ItemLoader): default_item_class = DmozItem default_output_processor = Identity() 

And my spider file:

 import scrapy from scrapy.spiders import Spider from scrapy.loader import ItemLoader from tutorial.items import MainItemLoader, DmozItem from scrapy.selector import Selector class DmozSpider(Spider): name = 'dmoz' allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/" ] def parse(self, response): for sel in response.xpath('//div[@class="site-item "]/div[@class="title-and-desc"]'): l = MainItemLoader(response=response) l.add_xpath('title', '/a/div[@class="site-title"]/text()') l.add_xpath('link', '/a/@href') l.add_xpath('desc', '/div[@class="site-descr "]/text()') yield l.load_item() 

I tried several different alternatives. I suspect that the main problem is the "response = response" part of the itemloader declaration, but I cannot make the headers or tails of the documentation for this procedure. Could the selector = "blah" syntax be used, where should I look?

If I run this, I get a list of 22 empty brackets (the correct number of books). If I change the first slash in each add_xpath line to be a double slash, I get 22 identical lists containing ALL data (not surprisingly).

How can I write this so that itemloader creates a new list containing the necessary fields for each individual book?

Thanks!

+7
python web-scraping scrapy
source share
1 answer

You need your ItemLoader work inside a specific selector, not response :

 l = MainItemLoader(selector=sel) l.add_xpath('title', './a/div[@class="site-title"]/text()') l.add_xpath('link', './a/@href') l.add_xpath('desc', './div[@class="site-descr "]/text()') yield l.load_item() 

Also note the points at the beginning of XPath expressions.

+11
source share

All Articles