Saving specific XML node values ​​with R xmlEventParse

I have a large XML file that I need to parse with xmlEventParse in R. Unfortunately, online examples are more complicated than I need, and I just want to mark the corresponding node tag to store the matching node text (not an attribute), each text in a separate list, see comments in the code below:

library(XML) z <- xmlEventParse( "my.xml", handlers = list( startDocument = function() { cat("Starting document\n") }, startElement = function(name,attr) { if ( name == "myNodeToMatch1" ){ cat("FLAG Matched element 1\n") } if ( name == "myNodeToMatch2" ){ cat("FLAG Matched element 2\n") } }, text = function(text) { if ( # Matched element 1 .... ) # Store text in element 1 list if ( # Matched element 2 .... ) # Store text in element 2 list }, endDocument = function() { cat("ending document\n") } ), addContext = FALSE, useTagName = FALSE, ignoreBlanks = TRUE, trim = TRUE) z$ ... # show lists ?? 

My question is how to implement this flag in R (professionally :)? Plus: what is the best choice for evaluating N arbitrary nodes to match ... if name = "myNodeToMatchN" ... to nodes avoiding matching cases?

my.xml might just be naive XML like

 <A> <myNodeToMatch1>Text in NodeToMatch1</myNodeToMatch1> <B> <myNodeToMatch2>Text in NodeToMatch2</myNodeToMatch2> ... </B> </A> 
+6
r xml-parsing sax
source share
3 answers

I am using fileName from example(xmlEventParse) as a reproducible example. It has record tags that have an id attribute and text that we would like to extract. Instead of using handler , I will go after the branches argument. This looks like a handler, but it has access to the full node, not just the element. The idea is to write a closure in which there is a place to store the data we have accumulated, and functions for processing each branch of the XML document that we are interested in. So let's start by defining a closure - for our purposes - a function that returns a list of functions

 ourBranches <- function() { 

We need a place to store the results that we accumulate, choosing the environment so that the insertion time is constant (and not the list that we would need to add and will be ineffective)

  store <- new.env() 

The event parser expects a list of functions that will be called when a corresponding tag is detected. We are interested in the tag record . The function we are writing will receive the node of the XML document. We want to extract the id element that we will use to store (text) values ​​in node. We add them to our store.

  record <- function(x, ...) { key <- xmlAttrs(x)[["id"]] value <- xmlValue(x) store[[key]] <- value } 

Once the document is processed, we need a convenient way to get our results, so we will add a function for our own purposes, regardless of the nodes in the document

  getStore <- function() as.list(store) 

and then finish closing by returning a list of functions

  list(record=record, getStore=getStore) } 

The difficult concept is that the environment in which the function is defined is part of the function, so every time we say ourBranches() , we get a list of functions and a new store environment to save our results. To use, call xmlEventParse in our file with an empty set of event handlers and get access to our accumulated storage.

 > branches <- ourBranches() > xmlEventParse(fileName, list(), branches=branches) list() > head(branches$getStore(), 2) $`Hornet Sportabout` [1] "18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 " $`Toyota Corolla` [1] "33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 " 
+6
source share

For others who might try to evade M. Morgan - here is the complete code

 fileName = system.file("exampleData", "mtcars.xml", package = "XML") ourBranches <- function() { store <- new.env() record <- function(x, ...) { key <- xmlAttrs(x)[["id"]] value <- xmlValue(x) store[[key]] <- value } getStore <- function() as.list(store) list(record=record, getStore=getStore) } branches <- ourBranches() xmlEventParse(fileName, list(), branches=branches) head(branches$getStore(), 2) 
+2
source share

The branch method does not preserve the order of events. In other words, the "write" order in the $ getStore () branches is different from the original in the XML file. On the other hand, handler methods can preserve order. Here is the code:

 fileName <- system.file("exampleData", "mtcars.xml", package="XML") records <- new('list') variable <- new('character') tag.open <- new('character') nvar <- 0 xmlEventParse(fileName, list(startElement = function (name, attrs) { tagName <<- name tag.open <<- c(name, tag.open) if (length(attrs)) { attributes(tagName) <<- as.list(attrs) } }, text = function (x) { if (nchar(x) > 0) { if (tagName == "record") { record <- list() record[[attributes(tagName)$id]] <- x records <<- c(records, record) } else { if( tagName == 'variable') { v <- x variable <<- c( variable, v) nvar <<- nvar + 1 } } } }, endElement = function (name) { if( name == 'record') { print(paste(tag.open, collapse='>')) } tag.open <<- tag.open[-1] })) head(records,2) $``Mazda RX4`` [1] "21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4" $`Mazda RX4 Wag` [1] "21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4" variable [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb" 

Another advantage of using handlers is that you can capture a hierarchical structure. In other words, you can save ancestors. One of the key points of this process is the use of global variables that can be assigned "<lt; -" instead of "<-".

0
source share

All Articles