I want to use Scrapy on the Dmoz website, which they use in their tutorials, but instead of just reading books in the URLs of the books ( http://www.dmoz.org/Computers/Programming/Languages/Python/Books / ), using the Item / Field pairs, I want to create an Itemloader that will read the required values ββ(name, title, description).
This is my items.py file:
from scrapy.item import Item, Field from scrapy.contrib.loader import ItemLoader from scrapy.contrib.loader.processor import Identity class DmozItem(Item): title = Field( output_processor=Identity() ) link = Field( output_processor=Identity() ) desc = Field( output_processor=Identity() ) class MainItemLoader(ItemLoader): default_item_class = DmozItem default_output_processor = Identity()
And my spider file:
import scrapy from scrapy.spiders import Spider from scrapy.loader import ItemLoader from tutorial.items import MainItemLoader, DmozItem from scrapy.selector import Selector class DmozSpider(Spider): name = 'dmoz' allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/" ] def parse(self, response): for sel in response.xpath('//div[@class="site-item "]/div[@class="title-and-desc"]'): l = MainItemLoader(response=response) l.add_xpath('title', '/a/div[@class="site-title"]/text()') l.add_xpath('link', '/a/@href') l.add_xpath('desc', '/div[@class="site-descr "]/text()') yield l.load_item()
I tried several different alternatives. I suspect that the main problem is the "response = response" part of the itemloader declaration, but I cannot make the headers or tails of the documentation for this procedure. Could the selector = "blah" syntax be used, where should I look?
If I run this, I get a list of 22 empty brackets (the correct number of books). If I change the first slash in each add_xpath line to be a double slash, I get 22 identical lists containing ALL data (not surprisingly).
How can I write this so that itemloader creates a new list containing the necessary fields for each individual book?
Thanks!
python web-scraping scrapy
Paulo black
source share