How to get innerHTML from node using Scrapy selector?

Question

How to get innerHTML from node using Scrapy selector?

Suppose there are some html snippets, such as:

<a> text in a <b>text in b</b> <c>text in c</c> </a> <a> <b>text in b</b> text in a <c>text in c</c> </a>

In which I want to extract texts inside a tag, but excluding those tags, preserving their text, for example, the content I want to extract above will look like "text in text in text in text b to c" and "text in b text in text on . ". Now I could get the nodes using the scoice Selector css () function, and then how can I continue these nodes to get what I want? Any idea would be appreciated, thanks!

+5

python html css-selectors xpath scrapy

kuixiong Feb 22 '15 at 12:58

source share

2 answers

You can use the XPath string() function for the elements you select:

 $ python >>> import scrapy >>> selector = scrapy.Selector(text="""<a> ... text in a ... <b>text in b</b> ... <c>text in c</c> ... </a> ... <a> ... <b>text in b</b> ... text in a ... <c>text in c</c> ... </a>""", type="html") >>> for link in selector.css('a'): ... print link.xpath('string(.)').extract() ... [u'\n text in a\n text in b\n text in c\n'] [u'\n text in b\n text in a\n text in c\n'] >>>

+4

paul trmbrth Feb 23 '15 at 10:47

source share

Golfwolf · Accepted Answer · 2015-02-22T13:48:28+0000

Here is what I managed to do:

 from scrapy.selector import Selector sel = Selector(text = html_string) for node in sel.css('a *::text'): print node.extract()

Assuming html_string is a variable containing html in your question, this code produces the following output:

  text in a text in b text in c text in b text in a text in c

The a *::text() selector matches all text nodes that are descendants of a .

How to get innerHTML from node using Scrapy selector?

More articles: