R - xpathApply on XMLNodeSet (with XML package)

Question

R - xpathApply on XMLNodeSet (with XML package)

I am trying to use the xpathApply function in an XML package in R to extract certain data from an html file. However, after I use xpathApply for some parent nodes of the html document, the class of the resulting object becomes XMLNodeSet, and I can no longer use xpathApply for such an object, because this error message appears: "Error in UseMethod (" xpathApply "): not applicable method for "xpathApply" applied to an object of class "XMLNodeSet" "

Here is the R script I am trying to replicate my problem (this example is just a simple table, I know I can use the readHTMLtable function, but I need to use a lower level function to work because my actual html is more complicated than this simple table):

library(XML) y <- htmlParse(htmlfile) x <- xpathApply(y, "//table/tr") z <- xpathApply(x, "/td")

Here is the "htmlfile":

 <table> <tr> <td> Test1.1 </td> <td> Test1.2 </td> </tr> <tr> <td> Test1.3 </td> <td> Test1.4 </td> </tr> </table>

Is there any method for further work on nodes after using xpathApply? Or are there other good alternatives for reproducing data in nodes?

+4

html r web-scraping

Joyce Feb 19 '13 at 12:34

source share

3 answers

c0bra · Answer 1 · 2015-02-18T14:12:26+0000

Although the solution for determining the right xPath looks better, you can do this:

 library(XML) y <- htmlParse(htmlfile) x <- getNodeSet(y, "//table/tr") z <- lapply(x, function(x){ subDoc <- xmlDoc(x) r <- xpathApply(x, "/td") free(subDoc) # not sure if necessary return(r) })

agstudy · Answer 2 · 2013-02-19T13:08:31+0000

Once you have a node list, you can apply a function on it to extract the node. A function of type xmlValue or xmlGetAttr .... For example:

 x <- xpathApply(y, "//table/tr") sapply(x,xmlValue) ## it a list of nodes.. " Test1.1 Test1.2 " " Test1.3 Test1.4 "

Which is equivalent:

 xpathSApply(y,"//table/tr",xmlValue) " Test1.1 Test1.2 " " Test1.3 Test1.4 "

EDIT

I am sure your question can be resolved with the correct xpath. You must learn to work with xml files when working with a database. xpath is just like sql query. it's fast, and many browsers can help you create the correct xpath.

For instance:

  xpathSApply(y,"//table/tr[2]/td[1]",xmlValue) # second tr and first td [1] " Test1.3 " xpathSApply(y,"//table/tr[2]/td[3]",xmlValue) # second tr and third td

EDIT

OP looks if it wants to replicate the XML structure (get tr and td in the same order)

here is the way, I don't think this is a more efficient way ...

 nn.trs <- length(xpathSApply(y,"//table/tr",I)) lapply(seq(nn.trs),function(i){ xpathSApply(y,paste("//table/tr[",i,"]/td",sep=''),xmlValue) }) [[1]] [1] " Test1.1 " " Test1.2 " [[2]] [1] " Test1.3 " " Test1.4 "

If, if the number td is the same for each tr, you can replace lapply with sapply , and you get:

  [,1] [,2] [1,] " Test1.1 " " Test1.3 " [2,] " Test1.2 " " Test1.4 "

But I think readHtmlTable is better in this case.

Chinmay patil · Answer 3 · 2013-02-19T13:14:30+0000

They seem to work. Essentially you need to look for the elements of the list returned by xpathApply

 > y <- htmlParse(htmlfile) > x <- xpathApply(y, "//table/tr") > x [[1]] <tr><td> Test1.1 </td> <td> Test1.2 </td> </tr> [[2]] <tr><td> Test1.3 </td> <td> Test1.4 </td> </tr> attr(,"class") [1] "XMLNodeSet" > z <- xpathApply(x[[1]], "//td") > z [[1]] <td> Test1.1 </td> [[2]] <td> Test1.2 </td> [[3]] <td> Test1.3 </td> [[4]] <td> Test1.4 </td> attr(,"class") [1] "XMLNodeSet"

PS: I'm not sure why he is looking for all the elements of the list x , and not just x[[1]] . Sounds like a mistake.

R - xpathApply on XMLNodeSet (with XML package)

More articles: