Random sampling from an XML file to a data frame in R

Question

Random sampling from an XML file to a data frame in R

How to get a sample of a given size from a large XML file in R?

Unlike just reading random strings, here you need to save the XML file structure for R in order to read it into the correct data.frame file.

A possible solution is to read the entire file, and then select the lines, but is it possible to read only the necessary fragments?

Sample from file:

<?xml version="1.0" encoding="UTF-8"?> <products> <product> <sku>967190</sku> <productId>98611</productId> ... <listingId/> <sellerId/> <shippingRestrictions/> </product> ...

The number of rows for each "product" is not equal. The final number of entries is unknown before opening the file.

+2

xml random r large-files dataframe

Anton Tarasenko Dec 21 '13 at 13:33

source share

2 answers

Here is an example based on the XML file that you provided.

 xml <- '<?xml version="1.0" encoding="UTF-8"?> <products> <product> <sku>967190</sku> <productId>98611</productId> <listingId/> <sellerId/> <shippingRestrictions/> </product> <product> <sku>967191</sku> <productId>98612</productId> <listingId/> <sellerId/> <shippingRestrictions/> </product> <product> <sku>967192</sku> <productId>98613</productId> <listingId/> <sellerId/> <shippingRestrictions/> </product> </products> ' # parse p <- xmlParse(xml) # get nodes nodes <- xpathApply(p, '//product') # return a random sample of notes nodes[sample(seq_along(nodes), 2)]

Here is the result:

 > nodes[sample(seq_along(nodes), 2)] [[1]] <product> <sku>967191</sku> <productId>98612</productId> <listingId/> <sellerId/> <shippingRestrictions/> </product> [[2]] <product> <sku>967190</sku> <productId>98611</productId> <listingId/> <sellerId/> <shippingRestrictions/> </product>

+1

Thomas Dec 22 '13 at 15:54

source share

Martin morgan · Accepted Answer · 2013-12-22T17:48:01+0000

Instead of reading the entire file, you can use event parsing with closure , which handles the nodes of interest to you. To get there, I'll start with a random sampling strategy from a file. The process writes one at a time. If the entry i th is less than or equal to the number n entries to save it, save it with probability n / i . It can be implemented as

 i <- 0L; n <- 10L select <- function() { i <<- i + 1L if (i <= n) i else { if (runif(1) < n / i) sample(n, 1) else 0 } }

which behaves as follows:

 > i <- 0L; n <- 10L; replicate(20, select()) [1] 1 2 3 4 5 6 7 8 9 10 1 5 7 0 1 9 0 2 1 0

This tells us about saving the first 10 elements, then we replace element 1 with element 11, element 5 with element 12, element 7 with element 13, then discard the 14th element, etc. Substitutions become less frequent when I become much larger than n.

We use this as part of the product handler, which pre-allocates space for the results we are interested in, and each time the “product” node is encountered, we check whether it should be selected, and if so, add it to our current results in the appropriate place

 sku <- character(n) product <- function(p) { i <- select() if (i) sku[[i]] <<- xmlValue(p[["sku"]]) NULL }

The "select" and "product" handlers are combined with a ( get ) function that allows us to retrieve the current values, and all of them are placed in closure, so that we have a factory view template that encapsulates the variables n , i and sku

 sampler <- function(n) { force(n) # otherwise lazy evaluation could lead to surprises i <- 0L select <- function() { i <<- i + 1L if (i <= n) { i } else { if (runif(1) < n / i) sample(n, 1) else 0 } } sku <- character(n) product <- function(p) { i <- select() if (i) sku[[i]] <<- xmlValue(p[["sku"]]) NULL } list(product=product, get=function() list(sku=sku)) }

And then we are ready to go

 products <- xmlTreeParse("foo.xml", handler=sampler(1000)) as.data.frame(products$get())

As soon as the number of processed nodes i becomes large relative to n , it will scale linearly with the file size, so you can understand how well it works quite well, starting with subsets of the original file.

Random sampling from an XML file to a data frame in R

More articles: