How to get innerHTML from node using Scrapy selector?

Suppose there are some html snippets, such as:

<a> text in a <b>text in b</b> <c>text in c</c> </a> <a> <b>text in b</b> text in a <c>text in c</c> </a> 

In which I want to extract texts inside a tag, but excluding those tags, preserving their text, for example, the content I want to extract above will look like "text in text in text in text b to c" and "text in b text in text on . ". Now I could get the nodes using the scoice Selector css () function, and then how can I continue these nodes to get what I want? Any idea would be appreciated, thanks!

+5
source share
2 answers

Here is what I managed to do:

 from scrapy.selector import Selector sel = Selector(text = html_string) for node in sel.css('a *::text'): print node.extract() 

Assuming html_string is a variable containing html in your question, this code produces the following output:

  text in a text in b text in c text in b text in a text in c 

The a *::text() selector matches all text nodes that are descendants of a .

+5
source

You can use the XPath string() function for the elements you select:

 $ python >>> import scrapy >>> selector = scrapy.Selector(text="""<a> ... text in a ... <b>text in b</b> ... <c>text in c</c> ... </a> ... <a> ... <b>text in b</b> ... text in a ... <c>text in c</c> ... </a>""", type="html") >>> for link in selector.css('a'): ... print link.xpath('string(.)').extract() ... [u'\n text in a\n text in b\n text in c\n'] [u'\n text in b\n text in a\n text in c\n'] >>> 
+4
source

Source: https://habr.com/ru/post/1213906/


All Articles