I am writing my first spider in Scrapy and am trying to follow the documentation. I have implemented ItemLoaders. A spider retrieves data, but the data contains many rows. I tried many ways to remove them, but nothing works. The replace_escape_chars utility should work, but I cannot figure out how to use it with ItemLoader . Also some people use (unicode.strip), but then again, I can't get it to work. Some people try to use them in items.py and others in the spider. How can I clear the data of these lines (\ r \ n)? My items.py file contains only item names and field (). Spider code below:
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.contrib.loader import XPathItemLoader from scrapy.utils.markup import replace_escape_chars from ccpstore.items import Greenhouse class GreenhouseSpider(BaseSpider): name = "greenhouse" allowed_domains = ["domain.com"] start_urls = [ "http://www.domain.com", ] def parse(self, response): items = [] l = XPathItemLoader(item=Greenhouse(), response=response) l.add_xpath('name', '//div[@class="product_name"]') l.add_xpath('title', '//h1') l.add_xpath('usage', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]') l.add_xpath('repeat', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]') l.add_xpath('direction', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]') items.append(l.load_item()) return items
source share