Using rvest in R to clear a web page, I would like to extract the innerHTML equivalent from node , in particular, to change line-breaks to new lines before applying html_text .
An example of the desired functionality:
library(rvest) doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>') innerHTML(doc, ".pp")
Print the following output:
[1] "<p class=\"pp\">First Line<br>Second Line</p>"
With rvest 0.2 this can be achieved using toString.XMLNode
# run under rvest 0.2 library(XML) html('<html><p class="pp">First Line<br />Second Line</p>') %>% html_node(".pp") %>% toString.XMLNode [1] "<p class=\"pp\">First Line<br>Second Line</p>"
With the new rvest 0.2.0.900 this no longer works.
# run under rvest 0.2.0.900 library(XML) html_node(doc,".pp") %>% toString.XMLNode [1] "{xml_node}\n<p>\n[1] <br/>"
The desired functionality is usually available in the write_xml function of the write_xml package, on which rvest now depends - if only write_xml can give its output to a variable instead of insisting on writing to a file. (also a textConnection not accepted).
As a workaround, I can temporarily write the file:
# extract innerHTML, workaround: write/read to/from temp file html_innerHTML <- function(x, css, xpath) { file <- tempfile() html_node(x,css) %>% write_xml(file) txt <- readLines(file, warn=FALSE) unlink(file) txt } html_innerHTML(doc, ".pp") [1] "<p class=\"pp\">First Line<br>Second Line</p>"
with this, I can then, for example, convert line break tags to newline characters:
html_innerHTML(doc, ".pp") %>% gsub("<br\\s*/?\\s*>","\n", .) %>% read_html %>% html_text [1] "First Line\nSecond Line"
Is there a better way to do this using existing functions, for example. rvest , xml2 , XML or other packages? In particular, I would like to avoid writing to the hard drive.