How to return HTML result with HtmlXPathSelector (Scrapy)

Question

How to return HTML result with HtmlXPathSelector (Scrapy)

How to get all the HTML contained inside a tag?

hxs = HtmlXPathSelector(response) element = hxs.select('//span[@class="title"]/')

Maybe something like:

 hxs.select('//span[@class="title"]/html()')

EDIT: If I look at the documentation , I only see methods for returning a new XPathSelectorList or just raw text inside the tag. I want to get not a new list or just text, but the HTML source code inside the tag. eg:

 <html> <head> <title></title> </head> <body> <div id="leexample"> justtext <p class="ihatelookingforfeatures"> sometext </p> <p class="yahc"> sometext </p> </div> <div id="lenot"> blabla </div> an awfuly long example for this. </body> </html>

I want to create a method such as hxs.select('//div[@id="leexample"]/html()') which should return an hxs.select('//div[@id="leexample"]/html()') inside, like so:

 justtext <p class="ihatelookingforfeatures"> sometext </p> <p class="yahc"> sometext </p>

I hope to clarify the ambiguity around my question.

How to get HTML- HtmlXPathSelector from HtmlXPathSelector in Scrapy? (maybe the solution goes beyond the scope?)

+4

python xpath scrapy

mirandalol Jul 13 '12 at 2:57

source share

6 answers

xiaowl · Answer 1 · 2012-07-13T04:07:59+0000

Call .extract() on the XpathSelectorList . It should return a list of Unicode strings containing the HTML content you want.

 hxs.select('//div[@id="leexample"]/*').extract()

Update

 # This is wrong hxs.select('//div[@id="leexample"]/html()').extract()

/html() invalid selector. To extract all the children, use '//div[@id="leexample"]/*' or '//div[@id="leexample"]/node()' . Note that node() will return textNode , a result of the form like:

  [u '\ n',
  u '& lta href = "image1.html"> Name: My image 1 
  ''
 ]

Dimitre novatchev · Answer 2 · 2012-07-13T03:46:17+0000

Using

 //span[@class="title"]/node()

this selects all nodes (elements, text nodes, processing commands and comments) that are children of any span element in the XML document, the class attribute is set to "title" .

If you want to get only the children nodes of the first such span in the document, use :

 (//span[@class="title"])[1]/node()

basaundi · Answer 3 · 2013-01-24T22:56:21+0000

Although I'm late, I leave it for recording.

What am I doing:

 html = ''.join(hxs.select('//span[@class="title"]/node()').extract())

Or, if we want to map different nodes:

 elements = hxs.select('//span[@class="title"]') html = [''.join(e) for e in elements.select('./node()')]

mirandalol · Answer 4 · 2012-07-13T04:30:37+0000

simulates what @xiaowl pointed out using hxs.select('//div[@id="leexample"]').extract() , you get the whole HTML tag extracted from the xPath: //div[@id="leexample"] .

so for the record I got:

 post = postItem() #body = Field #/in item.py post['body'] = hxs.select('//span[@id="edit' + self.postid+ '"]').extract() open('logs/test.log', 'wb').write(str(post['body'])) #logs.test.log contains all the HTML inside the tag selected by the query.

Sjaak trekhaak · Answer 5 · 2012-07-19T09:08:53+0000

In fact, it is not as difficult as it seems. Just remove the final / from your XPath request and use the extract () method. I gave an example in a scrapy shell , here is an abridged version:

 sjaak:~ sjaakt$ scrapy shell 2012-07-19 11:06:21+0200 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot) >>> fetch('http://www.nu.nl') 2012-07-19 11:06:34+0200 [default] INFO: Spider opened 2012-07-19 11:06:34+0200 [default] DEBUG: Crawled (200) <GET http://www.nu.nl> (referer: None) >>> hxs.select("//h1").extract() [u'<h1> <script type="text/javascript">document.write(NU.today())</script>.\n Het laatste nieuws het eerst op NU.nl </h1>\n '] >>>

To get only the internal content of a tag, use add / * for your XPath request. Example:

 >>> hxs.select("//h1/*").extract() [u'<script type="text/javascript">document.write(NU.today())</script>.\n Het laatste nieuws het eerst op NU.nl ']

Kangur · Answer 6 · 2016-03-06T17:32:42+0000

A bit of hacking (getting into the private property of _root of Selector , works in 1.0.5):

 from lxml import html def extract_inner_html(sel): return (sel._root.text or '') + ''.join([html.tostring(child) for child in sel._root.iterdescendants()]) def extract_inner_text(sel): return (''.join(sel.css('::text').extract())).strip()

Use it as:

 reason = extract_inner_html(statement.css(".politic-rating .rate-reason")[0]) text = extract_inner_text(statement.css('.politic-statement')[0]) all_text = extract_inner_text(statement.css('.politic-statement'))

I found a piece of lxml code in this question .

How to return HTML result with HtmlXPathSelector (Scrapy)

Update

More articles: