I want to read HTML files from a website. In particular, I want to read books in HTML format from gutenberg.org. The title of each chapter is marked with the tag βh2β, and the contents of each chapter should be in the tags of the paragraph βpβ after βh2β. Using the XML package, I can get the values ββor full HTML for each tag.
Here is a sample code using George Elliot Middlemark:
library(XML) doc.html = htmlTreeParse('http://www.gutenberg.org/files/145/145-h/145-h.htm', useInternal = TRUE) doc.value <- xpathApply(doc.html, '//h2|//p', xmlValue) doc.html.value <- xpathApply(doc.html, '//h2|//p')
doc.value contains a list in which each element is the contents of the tags, but I cannot know if this is an h2 tag or a p tag. On the other hand, doc.html.value contains a list with html code for each tag. This gives me information whether it is an βh2β or βpβ tag, but also contains a lot of additional code (like style information, etc.) that I don't need.
My question is: is there an easy way to get only the tag tag and tag value without other information related to it?
html xml r html-parsing
user2840286
source share