R: extract rvest innerHTML

Using rvest in R to clear a web page, I would like to extract the innerHTML equivalent from node , in particular, to change line-breaks to new lines before applying html_text .

An example of the desired functionality:

 library(rvest) doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>') innerHTML(doc, ".pp") 

Print the following output:

 [1] "<p class=\"pp\">First Line<br>Second Line</p>" 

With rvest 0.2 this can be achieved using toString.XMLNode

 # run under rvest 0.2 library(XML) html('<html><p class="pp">First Line<br />Second Line</p>') %>% html_node(".pp") %>% toString.XMLNode [1] "<p class=\"pp\">First Line<br>Second Line</p>" 

With the new rvest 0.2.0.900 this no longer works.

 # run under rvest 0.2.0.900 library(XML) html_node(doc,".pp") %>% toString.XMLNode [1] "{xml_node}\n<p>\n[1] <br/>" 

The desired functionality is usually available in the write_xml function of the write_xml package, on which rvest now depends - if only write_xml can give its output to a variable instead of insisting on writing to a file. (also a textConnection not accepted).

As a workaround, I can temporarily write the file:

 # extract innerHTML, workaround: write/read to/from temp file html_innerHTML <- function(x, css, xpath) { file <- tempfile() html_node(x,css) %>% write_xml(file) txt <- readLines(file, warn=FALSE) unlink(file) txt } html_innerHTML(doc, ".pp") [1] "<p class=\"pp\">First Line<br>Second Line</p>" 

with this, I can then, for example, convert line break tags to newline characters:

 html_innerHTML(doc, ".pp") %>% gsub("<br\\s*/?\\s*>","\n", .) %>% read_html %>% html_text [1] "First Line\nSecond Line" 

Is there a better way to do this using existing functions, for example. rvest , xml2 , XML or other packages? In particular, I would like to avoid writing to the hard drive.

+8
tostring r web-scraping innerhtml rvest
source share
1 answer

As @ r2evans noted, as.character(doc) is the solution.

Regarding the last piece of code that wants to extract <br> -separated text from node when converting <br> to a new line, there is a workaround in unresolved rvest issue # 175, comment # 2 :

A simplified version for this problem:

 doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>') # r2evan solution: as.character(rvest::html_node(doc, xpath="//p")) ##[1] "<p class=\"pp\">First Line<br>Second Line</p>" # rentrop@github solution, simplified: innerHTML <- function(x, trim = FALSE, collapse = "\n"){ paste(xml2::xml_find_all(x, ".//text()"), collapse = collapse) } innerHTML(doc) ## [1] "First Line\nSecond Line" 
0
source share

All Articles