Removing a tag in readHTMLTable in an XML package

Question

Removing a tag in readHTMLTable in an XML package

I am trying to clear data from a table at the following URL:

http://www.nfpa.org/itemDetail.asp?categoryID=953&itemID=23033

The problem is the superscripts contained in

 <sup> </sup>

tags When I use the following code (though not very elegant)

 url.overview <- "http://www.nfpa.org/itemDetail.asp?categoryID=953&itemID=23033" overview <- readHTMLTable(overview) overview <- overview[[2]] overview <- overview[-1,] f <- function(x){ out <- iconv(x, "latin1", "ASCII", sub="") out <- gsub('[\\$,]', '', out) out <- as.numeric(out) return(out) } overview <- matrix(f(as.character(unlist(overview))), ncol = ncol(overview)) overview <- as.data.frame(overview) names(overview) <- c('year', 'fires', 'civ.deaths', 'civ.injuries', 'ff.deaths', 'ff.injuries', 'damage.reported', 'damage.2010dollars')

I get exactly what I want, except that the values in the top rows are added to the end of the values in the table cells. For example, (using the row and column names from the above URL), “Civil deaths” in 2001 are stored as 61963, when they should be 6196, since the superscript 3 is interpreted as an extra digit. Any cells in the table that do not have a superscript look fine.

After many hours of working on the documentation, I was able to use the parseHTML and getNodeSet from the XML package to identify all the nodes containing the <sup> tags, but I could not figure out what to do from there:

 overview <- htmlParse(url.overview) getNodeSet(overview, "//sup")

I believe that for some reason I need to delete these parts of the XML tree, and then pass the result back to readHTMLTable for further processing, but I could not figure out how to do this.

I will be very grateful for your thoughts.

+6

r

inhuretnakht Aug 21 '12 at 10:56

source share

1 answer

shhhhimhuntingrabbits · Accepted Answer · 2012-08-22T00:37:11+0000

Try

 require(XML) url.overview <- "http://www.nfpa.org/itemDetail.asp?categoryID=953&itemID=23033" overview <- htmlParse(url.overview,encoding="UTF-8") temp<-getNodeSet(overview, "/*//span[@class=\"small\"]/sup") removeNodes(temp) app.data<-readHTMLTable(overview)[[2]]

so here we just delete the nodes that we don’t want and return the remainder back to readHTMLTable , taking the second table. I am having problems with the encoding in this window. You can leave the encoding in htmlParse , or it can work fine without you.

Removing a tag in readHTMLTable in an XML package

More articles: