How to remove white space in Spider Spider data

Question

How to remove white space in Spider Spider data

I am writing my first spider in Scrapy and am trying to follow the documentation. I have implemented ItemLoaders. A spider retrieves data, but the data contains many rows. I tried many ways to remove them, but nothing works. The replace_escape_chars utility should work, but I cannot figure out how to use it with ItemLoader . Also some people use (unicode.strip), but then again, I can't get it to work. Some people try to use them in items.py and others in the spider. How can I clear the data of these lines (\ r \ n)? My items.py file contains only item names and field (). Spider code below:

 from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.contrib.loader import XPathItemLoader from scrapy.utils.markup import replace_escape_chars from ccpstore.items import Greenhouse class GreenhouseSpider(BaseSpider): name = "greenhouse" allowed_domains = ["domain.com"] start_urls = [ "http://www.domain.com", ] def parse(self, response): items = [] l = XPathItemLoader(item=Greenhouse(), response=response) l.add_xpath('name', '//div[@class="product_name"]') l.add_xpath('title', '//h1') l.add_xpath('usage', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]') l.add_xpath('repeat', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]') l.add_xpath('direction', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]') items.append(l.load_item()) return items

+4

web-scraping scrapy

Dan walker Apr 16 '13 at 15:34

source share

2 answers

You can use default_output_processor on the bootloader, as well as other processors in separate fields, see title :

 from scrapy.spider import BaseSpider from scrapy.contrib.loader import XPathItemLoader from scrapy.contrib.loader.processor import Compose, MapCompose from w3lib.html import replace_escape_chars, remove_tags from ccpstore.items import Greenhouse class GreenhouseSpider(BaseSpider): name = "greenhouse" allowed_domains = ["domain.com"] start_urls = ["http://www.domain.com"] def parse(self, response): l = XPathItemLoader(Greenhouse(), response=response) l.default_output_processor = MapCompose(lambda v: v.strip(), replace_escape_chars) l.add_xpath('name', '//div[@class="product_name"]') l.add_xpath('title', '//h1', Compose(remove_tags)) l.add_xpath('usage', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]') l.add_xpath('repeat', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]') l.add_xpath('direction', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]') return l.load_item()

+7

Steven almeroth Apr 16 '13 at 16:01

source share

Dan walker · Accepted Answer · 2013-04-17T18:54:41+0000

It turns out that there were also many gaps in the data, so combining Stephen's answer with another study allowed the data to get all the tags, lines and duplicate spaces. The working code is given below. Note the addition of text () in the loader line, which removes tags and also splits and combines processors to remove spaces and return lines.

 def parse(self, response): items = [] l = XPathItemLoader(item=Greenhouse(), response=response) l.default_input_processor = MapCompose(lambda v: v.split(), replace_escape_chars) l.default_output_processor = Join() l.add_xpath('title', '//h1/text()') l.add_xpath('usage', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]/text()') l.add_xpath('repeat', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]/text()') l.add_xpath('direction', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]/text()') items.append(l.load_item()) return items

How to remove white space in Spider Spider data

More articles: