It is possible that the "internal" view is not so big
xml = xmlTreeParse("file.xml", useInternalNodes=TRUE)
and then xpath will definitely be your best bet. If this does not work, you need to get your head around closures. I am going to aim at the branches xmlEventParse argument, which allows parsing a hybrid event using file iteration combined with DOM parsing on each node. Here is a function that returns a list of functions.
branchFactory <- function() { env <- new.env(parent=emptyenv())
Inside this function, we are going to create a place to store our results as we go through the file. It may be a list, but it is better to use the environment. This will allow us to insert new results without copying all the results that we have already inserted. So, our environment:
env <- new.env(parent=emptyenv())
we use the parent argument as a security measure, even if it is not relevant in our present case. Now we define a function that will be called whenever a "FrameSet" node is encountered
FrameSet <- function(elt) { id <- paste(xmlAttrs(elt), collapse=":") env[[id]] <- xpathSApply(elt, "//Frame", xmlAttrs) }
It turns out that when we use the branches argument, xmlEventParse will be organized to parse the entire node into an object that we can manipulate through the DOM, for example using xlmAttrs and xpathSApply . The first line of this function creates a unique identifier for this set of frames (perhaps not for a complete set of data? You will need a unique identifier). we then analyze the "// Frame" part of the element and store it in our environment. Storing the result is more complicated than it sounds - we assign a variable called env . env does not exist in the body of the FrameSet function, so R uses its lexical scoping rules to find a variable called env in the environment in which the FrameSet function is defined. And so, he finds the env that we have already created. Here we add the result xpathSApply to. This is for our FrameSet node analyzer.
We also need a convenience function that we can use to extract env , for example:
get <- function() env
Again, this will use the lexical scope to look up the env variable created at the top of the branchFactory . We end branchFactory by returning a list of functions that we defined
list(get=get, FrameSet=FrameSet)
This is also surprisingly complicated - we return a list of functions. Functions are defined in the environment created by calling branchFactory and for the lexical region to work, the environment must be preserved. Therefore, in fact, we return not only a list of functions, but also an implicit variable env . Briefly
Now we are ready to analyze our file. Do this by creating an instance of the branch parser with its unique versions of the get and FrameSet and the env variable created to store the results. Then parse the file
b <- branchFactory() xx <- xmlEventParse("file.xml", handlers=list(), branches=b)
We can get the results using b$get() and we can apply this to the list, if convenient.
> as.list(b$get()) $`1sthalf:0000T0:REFEREE:00011D` [,1] [,2] [,3] N "0" "1" "2" T "2012-09-29T18:31:21" "2012-09-29T18:31:21" "2012-09-29T18:31:21" X "-0.1158" "-0.1146" "-0.1134" Y "0.2347" "0.2351" "0.2356" S "1.27" "1.3" "1.33" $`2ndhalf:0000T0:REFEREE:00011D` [,1] [,2] [,3] N "0" "1" "2" T "2012-09-29T18:31:21" "2012-09-29T18:31:21.196" "2012-09-29T18:31:21.243" X "-0.1158" "-0.1146" "-0.1134" Y "0.2347" "0.2351" "0.2356" S "1.27" "1.3" "1.33"