R - xpathApply on XMLNodeSet (with XML package)

I am trying to use the xpathApply function in an XML package in R to extract certain data from an html file. However, after I use xpathApply for some parent nodes of the html document, the class of the resulting object becomes XMLNodeSet, and I can no longer use xpathApply for such an object, because this error message appears: "Error in UseMethod (" xpathApply "): not applicable method for "xpathApply" applied to an object of class "XMLNodeSet" "

Here is the R script I am trying to replicate my problem (this example is just a simple table, I know I can use the readHTMLtable function, but I need to use a lower level function to work because my actual html is more complicated than this simple table):

library(XML) y <- htmlParse(htmlfile) x <- xpathApply(y, "//table/tr") z <- xpathApply(x, "/td") 

Here is the "htmlfile":

 <table> <tr> <td> Test1.1 </td> <td> Test1.2 </td> </tr> <tr> <td> Test1.3 </td> <td> Test1.4 </td> </tr> </table> 

Is there any method for further work on nodes after using xpathApply? Or are there other good alternatives for reproducing data in nodes?

+4
source share
3 answers

Although the solution for determining the right xPath looks better, you can do this:

 library(XML) y <- htmlParse(htmlfile) x <- getNodeSet(y, "//table/tr") z <- lapply(x, function(x){ subDoc <- xmlDoc(x) r <- xpathApply(x, "/td") free(subDoc) # not sure if necessary return(r) }) 
+2
source

Once you have a node list, you can apply a function on it to extract the node. A function of type xmlValue or xmlGetAttr .... For example:

 x <- xpathApply(y, "//table/tr") sapply(x,xmlValue) ## it a list of nodes.. " Test1.1 Test1.2 " " Test1.3 Test1.4 " 

Which is equivalent:

 xpathSApply(y,"//table/tr",xmlValue) " Test1.1 Test1.2 " " Test1.3 Test1.4 " 

EDIT

I am sure your question can be resolved with the correct xpath. You must learn to work with xml files when working with a database. xpath is just like sql query. it's fast, and many browsers can help you create the correct xpath.

For instance:

  xpathSApply(y,"//table/tr[2]/td[1]",xmlValue) # second tr and first td [1] " Test1.3 " xpathSApply(y,"//table/tr[2]/td[3]",xmlValue) # second tr and third td 

EDIT

OP looks if it wants to replicate the XML structure (get tr and td in the same order)

here is the way, I don't think this is a more efficient way ...

 nn.trs <- length(xpathSApply(y,"//table/tr",I)) lapply(seq(nn.trs),function(i){ xpathSApply(y,paste("//table/tr[",i,"]/td",sep=''),xmlValue) }) [[1]] [1] " Test1.1 " " Test1.2 " [[2]] [1] " Test1.3 " " Test1.4 " 

If, if the number td is the same for each tr, you can replace lapply with sapply , and you get:

  [,1] [,2] [1,] " Test1.1 " " Test1.3 " [2,] " Test1.2 " " Test1.4 " 

But I think readHtmlTable is better in this case.

+1
source

They seem to work. Essentially you need to look for the elements of the list returned by xpathApply

 > y <- htmlParse(htmlfile) > x <- xpathApply(y, "//table/tr") > x [[1]] <tr><td> Test1.1 </td> <td> Test1.2 </td> </tr> [[2]] <tr><td> Test1.3 </td> <td> Test1.4 </td> </tr> attr(,"class") [1] "XMLNodeSet" > z <- xpathApply(x[[1]], "//td") > z [[1]] <td> Test1.1 </td> [[2]] <td> Test1.2 </td> [[3]] <td> Test1.3 </td> [[4]] <td> Test1.4 </td> attr(,"class") [1] "XMLNodeSet" 

PS: I'm not sure why he is looking for all the elements of the list x , and not just x[[1]] . Sounds like a mistake.

+1
source

All Articles