Scrapy: Why are the extracted lines in this format?

I do

item['desc'] = site.select('a/text()').extract() 

but it will be printed as follows

 [u'\n A mano libera\n '] 

What should I do with tim and delete strange characters like [u '\ n, traling space and?]?

I can't crop (strip)

 exceptions.AttributeError: 'list' object has no attribute 'strip' 

and if converted to a string and then deleted, the result will be the string above, which I suppose is in UTF-8

+7
source share
3 answers

An html page can have a lot of characters for these spaces.

That you are extracting a list of unicode strings, so you cannot just call strip on it. If you want to remove these space characters from each line in this list, you can run the following:

 >>> [s.strip() for s in [u'\n A mano libera\n ']] [u'A mano libera'] 

If you only need the first element, than simply:

 >>> [u'\n A mano libera\n '][0].strip() u'A mano libera' 
+8
source

A good solution for this is with Load Loaders . The Loaders element is objects that receive data from responses, process data, and build elements for you. Here is an example of an element loader that will split lines and return the first value that matches XPath, if any:

 from scrapy.contrib.loader import XPathItemLoader from scrapy.contrib.loader.processor import MapCompose, TakeFirst class MyItemLoader(XPathItemLoader): default_item_class = MyItem default_input_processor = MapCompose(lambda string: string.strip()) default_output_processor = TakeFirst() 

And you use it as follows:

 def parse(self, response): loader = MyItemLoader(response=response) loader.add_xpath('desc', 'a/text()') return loader.load_item() 
+8
source
 desc = site.select('a/text()').extract() desc=[s.strip() for s in desc] item['desc']=desc[0] 
+1
source

All Articles