Scrapy: Why are the extracted lines in this format?

Question

Scrapy: Why are the extracted lines in this format?

I do

item['desc'] = site.select('a/text()').extract()

but it will be printed as follows

 [u'\n A mano libera\n ']

What should I do with tim and delete strange characters like [u '\ n, traling space and?]?

I can't crop (strip)

 exceptions.AttributeError: 'list' object has no attribute 'strip'

and if converted to a string and then deleted, the result will be the string above, which I suppose is in UTF-8

+7

python scrapy

realtebo Jun 08 '13 at 14:44

source share

3 answers

A good solution for this is with Load Loaders . The Loaders element is objects that receive data from responses, process data, and build elements for you. Here is an example of an element loader that will split lines and return the first value that matches XPath, if any:

 from scrapy.contrib.loader import XPathItemLoader from scrapy.contrib.loader.processor import MapCompose, TakeFirst class MyItemLoader(XPathItemLoader): default_item_class = MyItem default_input_processor = MapCompose(lambda string: string.strip()) default_output_processor = TakeFirst()

And you use it as follows:

 def parse(self, response): loader = MyItemLoader(response=response) loader.add_xpath('desc', 'a/text()') return loader.load_item()

+8

Capi etheel Jun 10 '13 at 23:50

source share

 desc = site.select('a/text()').extract() desc=[s.strip() for s in desc] item['desc']=desc[0]

+1

Nanhe kumar Jul 18 '16 at 11:32

source share

icecrime · Accepted Answer · 2013-06-08T14:48:30+0000

An html page can have a lot of characters for these spaces.

That you are extracting a list of unicode strings, so you cannot just call strip on it. If you want to remove these space characters from each line in this list, you can run the following:

 >>> [s.strip() for s in [u'\n A mano libera\n ']] [u'A mano libera']

If you only need the first element, than simply:

 >>> [u'\n A mano libera\n '][0].strip() u'A mano libera'

Scrapy: Why are the extracted lines in this format?

More articles: