How to return HTML result with HtmlXPathSelector (Scrapy)

How to get all the HTML contained inside a tag?

hxs = HtmlXPathSelector(response) element = hxs.select('//span[@class="title"]/') 

Maybe something like:

 hxs.select('//span[@class="title"]/html()') 

EDIT: If I look at the documentation , I only see methods for returning a new XPathSelectorList or just raw text inside the tag. I want to get not a new list or just text, but the HTML source code inside the tag. eg:

 <html> <head> <title></title> </head> <body> <div id="leexample"> justtext <p class="ihatelookingforfeatures"> sometext </p> <p class="yahc"> sometext </p> </div> <div id="lenot"> blabla </div> an awfuly long example for this. </body> </html> 

I want to create a method such as hxs.select('//div[@id="leexample"]/html()') which should return an hxs.select('//div[@id="leexample"]/html()') inside, like so:

 justtext <p class="ihatelookingforfeatures"> sometext </p> <p class="yahc"> sometext </p> 

I hope to clarify the ambiguity around my question.

How to get HTML- HtmlXPathSelector from HtmlXPathSelector in Scrapy? (maybe the solution goes beyond the scope?)

+4
source share
6 answers

Call .extract() on the XpathSelectorList . It should return a list of Unicode strings containing the HTML content you want.

 hxs.select('//div[@id="leexample"]/*').extract() 

Update

 # This is wrong hxs.select('//div[@id="leexample"]/html()').extract() 

/html() invalid selector. To extract all the children, use '//div[@id="leexample"]/*' or '//div[@id="leexample"]/node()' . Note that node() will return textNode , a result of the form like:

  [u '\ n',
  u '& lta href = "image1.html"> Name: My image 1 
'' ]
+5
source

Using

 //span[@class="title"]/node() 

this selects all nodes (elements, text nodes, processing commands and comments) that are children of any span element in the XML document, the class attribute is set to "title" .

If you want to get only the children nodes of the first such span in the document, use :

 (//span[@class="title"])[1]/node() 
+3
source

Although I'm late, I leave it for recording.

What am I doing:

 html = ''.join(hxs.select('//span[@class="title"]/node()').extract()) 

Or, if we want to map different nodes:

 elements = hxs.select('//span[@class="title"]') html = [''.join(e) for e in elements.select('./node()')] 
+1
source

simulates what @xiaowl pointed out using hxs.select('//div[@id="leexample"]').extract() , you get the whole HTML tag extracted from the xPath: //div[@id="leexample"] .

so for the record I got:

 post = postItem() #body = Field #/in item.py post['body'] = hxs.select('//span[@id="edit' + self.postid+ '"]').extract() open('logs/test.log', 'wb').write(str(post['body'])) #logs.test.log contains all the HTML inside the tag selected by the query. 
0
source

In fact, it is not as difficult as it seems. Just remove the final / from your XPath request and use the extract () method. I gave an example in a scrapy shell , here is an abridged version:

 sjaak:~ sjaakt$ scrapy shell 2012-07-19 11:06:21+0200 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot) >>> fetch('http://www.nu.nl') 2012-07-19 11:06:34+0200 [default] INFO: Spider opened 2012-07-19 11:06:34+0200 [default] DEBUG: Crawled (200) <GET http://www.nu.nl> (referer: None) >>> hxs.select("//h1").extract() [u'<h1> <script type="text/javascript">document.write(NU.today())</script>.\n Het laatste nieuws het eerst op NU.nl </h1>\n '] >>> 

To get only the internal content of a tag, use add / * for your XPath request. Example:

 >>> hxs.select("//h1/*").extract() [u'<script type="text/javascript">document.write(NU.today())</script>.\n Het laatste nieuws het eerst op NU.nl '] 
0
source

A bit of hacking (getting into the private property of _root of Selector , works in 1.0.5):

 from lxml import html def extract_inner_html(sel): return (sel._root.text or '') + ''.join([html.tostring(child) for child in sel._root.iterdescendants()]) def extract_inner_text(sel): return (''.join(sel.css('::text').extract())).strip() 

Use it as:

 reason = extract_inner_html(statement.css(".politic-rating .rate-reason")[0]) text = extract_inner_text(statement.css('.politic-statement')[0]) all_text = extract_inner_text(statement.css('.politic-statement')) 

I found a piece of lxml code in this question .

0
source

All Articles