HTML file analysis in R

Question

HTML file analysis in R

I want to read HTML files from a website. In particular, I want to read books in HTML format from gutenberg.org. The title of each chapter is marked with the tag “h2”, and the contents of each chapter should be in the tags of the paragraph “p” after “h2”. Using the XML package, I can get the values or full HTML for each tag.

Here is a sample code using George Elliot Middlemark:

library(XML) doc.html = htmlTreeParse('http://www.gutenberg.org/files/145/145-h/145-h.htm', useInternal = TRUE) doc.value <- xpathApply(doc.html, '//h2|//p', xmlValue) doc.html.value <- xpathApply(doc.html, '//h2|//p')

doc.value contains a list in which each element is the contents of the tags, but I cannot know if this is an h2 tag or a p tag. On the other hand, doc.html.value contains a list with html code for each tag. This gives me information whether it is an “h2” or “p” tag, but also contains a lot of additional code (like style information, etc.) that I don't need.

My question is: is there an easy way to get only the tag tag and tag value without other information related to it?

+7

html xml r html-parsing

user2840286 Nov 02 '13 at 23:32

source share

1 answer

musically_ut · Accepted Answer · 2013-11-02T23:42:49+0000

Looking at the documentation for xmlValue , it is assumed that there is another function called xmlName that retrieves only the tag name. Using these two, you can calculate what you want:

 doc.html.name.value <- xpathApply(doc.html, '//h2|//p', function(x) { list(name=xmlName(x), content=xmlValue(x)); }) > doc.html.name.value[[1]] $name [1] "h2" $content [1] "\r\nGeorge Eliot\r\n"

HTML file analysis in R

More articles: