I am trying to clear data from a table at the following URL:
http:
The problem is the superscripts contained in
<sup> </sup>
tags When I use the following code (though not very elegant)
url.overview <- "http://www.nfpa.org/itemDetail.asp?categoryID=953&itemID=23033" overview <- readHTMLTable(overview) overview <- overview[[2]] overview <- overview[-1,] f <- function(x){ out <- iconv(x, "latin1", "ASCII", sub="") out <- gsub('[\\$,]', '', out) out <- as.numeric(out) return(out) } overview <- matrix(f(as.character(unlist(overview))), ncol = ncol(overview)) overview <- as.data.frame(overview) names(overview) <- c('year', 'fires', 'civ.deaths', 'civ.injuries', 'ff.deaths', 'ff.injuries', 'damage.reported', 'damage.2010dollars')
I get exactly what I want, except that the values in the top rows are added to the end of the values in the table cells. For example, (using the row and column names from the above URL), “Civil deaths” in 2001 are stored as 61963, when they should be 6196, since the superscript 3 is interpreted as an extra digit. Any cells in the table that do not have a superscript look fine.
After many hours of working on the documentation, I was able to use the parseHTML and getNodeSet from the XML package to identify all the nodes containing the <sup> tags, but I could not figure out what to do from there:
overview <- htmlParse(url.overview) getNodeSet(overview, "//sup")
I believe that for some reason I need to delete these parts of the XML tree, and then pass the result back to readHTMLTable for further processing, but I could not figure out how to do this.
I will be very grateful for your thoughts.