1.0 ...">

How to vectorize xml data?

let's say I have this xml file:

<?xml version="1.0" encoding="UTF-8" ?> <TimeSeries> <timeZone>1.0</timeZone> <series> <header/> <event date="2009-09-30" time="10:00:00" value="0.0" flag="2"></event> <event date="2009-09-30" time="10:15:00" value="0.0" flag="2"></event> <event date="2009-09-30" time="10:30:00" value="0.0" flag="2"></event> <event date="2009-09-30" time="10:45:00" value="0.0" flag="2"></event> <event date="2009-09-30" time="11:00:00" value="0.0" flag="2"></event> <event date="2009-09-30" time="11:15:00" value="0.0" flag="2"></event> </series> <series> <header/> <event date="2009-09-30" time="08:00:00" value="1.0" flag="2"></event> <event date="2009-09-30" time="08:15:00" value="2.6" flag="2"></event> <event date="2009-09-30" time="09:00:00" value="6.3" flag="2"></event> <event date="2009-09-30" time="09:15:00" value="4.4" flag="2"></event> <event date="2009-09-30" time="09:30:00" value="3.9" flag="2"></event> <event date="2009-09-30" time="09:45:00" value="2.0" flag="2"></event> <event date="2009-09-30" time="10:00:00" value="1.7" flag="2"></event> <event date="2009-09-30" time="10:15:00" value="2.3" flag="2"></event> <event date="2009-09-30" time="10:30:00" value="2.0" flag="2"></event> </series> <series> <header/> <event date="2009-09-30" time="10:00:00" value="0.0" flag="2"></event> <event date="2009-09-30" time="10:15:00" value="0.0" flag="2"></event> <event date="2009-09-30" time="10:30:00" value="0.0" flag="2"></event> <event date="2009-09-30" time="10:45:00" value="0.0" flag="2"></event> <event date="2009-09-30" time="11:00:00" value="0.0" flag="2"></event> </series> </TimeSeries> 

and say that I want to do something with its elements in the series, and that I would like to put into practice the “vectorize vector” advice ... I import the XML library and do the following:

 R> library("XML") R> doc <- xmlTreeParse('/home/mario/Desktop/sample.xml') R> TimeSeriesNode <- xmlRoot(doc) R> seriesNodes <- xmlElementsByTagName(TimeSeriesNode, "series") R> length(seriesNodes) [1] 3 R> (function(x){length(xmlElementsByTagName(x[['series']], 'event'))} + )(seriesNodes) [1] 6 R> 

and I don’t understand why I should get the result of applying the function to the first element: I expected three values, like the length of seriesNodes, something like this:

 R> mapply(length, seriesNodes) series series series 7 10 6 

oops! I already came up with the answer: "use mapply ":

 R> mapply(function(x){length(xmlElementsByTagName(x, 'event'))}, seriesNodes) series series series 6 9 5 

but then I see the following problem: R-inferno tells me that I am “hiding the loop” and not “vectorize”! can i avoid the loop at all? ...

+4
source share
2 answers

You can also use xpathApply or xpathSApply - these functions retrieve node commands using the XPath specification and then perform the function of each set. Both of these features are provided by the XML package. To use these functions, the XML document must be parsed using xmlInternalTreeParse or with the useInternalNodes xmlTreeParse parameter set to true:

 require( XML ) countEvents <- function( series ){ events <- xmlElementsByTagName( series, 'event' ) return( length( events ) ) } doc <- xmlTreeParse( "sample.xml", useInternalNodes = T ) xpathSApply( doc, '/TimeSeries/series', countEvents ) [1] 6 9 5 

I don’t know if this is “faster”, but the code is definitely cleaner and very clear to anyone who knows XPath syntax and how the apply function works.

+3
source

Since seriesNodes is a list of nodes, there is no easy way to avoid an implicit loop. Simple operations, such as obtaining lengths, are not computationally intensive, so I would not lose sleep because of the impossibility of vectorization.

Note that you can use sapply(seriesNodes, length) instead of mapply , since there is only one argument for the length function.

The “proper R-way” is to use the (s|m)apply calls to retrieve the vectors of useful data bits, and then parse them in the usual way.

Finally, if you really desperately need to vectorize counting events, use names(unlist(seriesNodes)) , and then count the events of "series.children.event.name" between each of the events of "series.name" . This is undoubtedly uglier and perhaps slower than a sapply call.

+3
source

All Articles