HTML file analysis in R

I want to read HTML files from a website. In particular, I want to read books in HTML format from gutenberg.org. The title of each chapter is marked with the tag β€œh2”, and the contents of each chapter should be in the tags of the paragraph β€œp” after β€œh2”. Using the XML package, I can get the values ​​or full HTML for each tag.

Here is a sample code using George Elliot Middlemark:

library(XML) doc.html = htmlTreeParse('http://www.gutenberg.org/files/145/145-h/145-h.htm', useInternal = TRUE) doc.value <- xpathApply(doc.html, '//h2|//p', xmlValue) doc.html.value <- xpathApply(doc.html, '//h2|//p') 

doc.value contains a list in which each element is the contents of the tags, but I cannot know if this is an h2 tag or a p tag. On the other hand, doc.html.value contains a list with html code for each tag. This gives me information whether it is an β€œh2” or β€œp” tag, but also contains a lot of additional code (like style information, etc.) that I don't need.

My question is: is there an easy way to get only the tag tag and tag value without other information related to it?

+7
html xml r html-parsing
source share
1 answer

Looking at the documentation for xmlValue , it is assumed that there is another function called xmlName that retrieves only the tag name. Using these two, you can calculate what you want:

 doc.html.name.value <- xpathApply(doc.html, '//h2|//p', function(x) { list(name=xmlName(x), content=xmlValue(x)); }) > doc.html.name.value[[1]] $name [1] "h2" $content [1] "\r\nGeorge Eliot\r\n" 
+5
source share

All Articles