Random sampling from an XML file to a data frame in R

How to get a sample of a given size from a large XML file in R?

Unlike just reading random strings, here you need to save the XML file structure for R in order to read it into the correct data.frame file.

A possible solution is to read the entire file, and then select the lines, but is it possible to read only the necessary fragments?

Sample from file:

<?xml version="1.0" encoding="UTF-8"?> <products> <product> <sku>967190</sku> <productId>98611</productId> ... <listingId/> <sellerId/> <shippingRestrictions/> </product> ... 

The number of rows for each "product" is not equal. The final number of entries is unknown before opening the file.

+2
xml random r large-files dataframe
source share
2 answers

Instead of reading the entire file, you can use event parsing with closure , which handles the nodes of interest to you. To get there, I'll start with a random sampling strategy from a file. The process writes one at a time. If the entry i th is less than or equal to the number n entries to save it, save it with probability n / i . It can be implemented as

 i <- 0L; n <- 10L select <- function() { i <<- i + 1L if (i <= n) i else { if (runif(1) < n / i) sample(n, 1) else 0 } } 

which behaves as follows:

 > i <- 0L; n <- 10L; replicate(20, select()) [1] 1 2 3 4 5 6 7 8 9 10 1 5 7 0 1 9 0 2 1 0 

This tells us about saving the first 10 elements, then we replace element 1 with element 11, element 5 with element 12, element 7 with element 13, then discard the 14th element, etc. Substitutions become less frequent when I become much larger than n.

We use this as part of the product handler, which pre-allocates space for the results we are interested in, and each time the โ€œproductโ€ node is encountered, we check whether it should be selected, and if so, add it to our current results in the appropriate place

 sku <- character(n) product <- function(p) { i <- select() if (i) sku[[i]] <<- xmlValue(p[["sku"]]) NULL } 

The "select" and "product" handlers are combined with a ( get ) function that allows us to retrieve the current values, and all of them are placed in closure, so that we have a factory view template that encapsulates the variables n , i and sku

 sampler <- function(n) { force(n) # otherwise lazy evaluation could lead to surprises i <- 0L select <- function() { i <<- i + 1L if (i <= n) { i } else { if (runif(1) < n / i) sample(n, 1) else 0 } } sku <- character(n) product <- function(p) { i <- select() if (i) sku[[i]] <<- xmlValue(p[["sku"]]) NULL } list(product=product, get=function() list(sku=sku)) } 

And then we are ready to go

 products <- xmlTreeParse("foo.xml", handler=sampler(1000)) as.data.frame(products$get()) 

As soon as the number of processed nodes i becomes large relative to n , it will scale linearly with the file size, so you can understand how well it works quite well, starting with subsets of the original file.

+3
source share

Here is an example based on the XML file that you provided.

 xml <- '<?xml version="1.0" encoding="UTF-8"?> <products> <product> <sku>967190</sku> <productId>98611</productId> <listingId/> <sellerId/> <shippingRestrictions/> </product> <product> <sku>967191</sku> <productId>98612</productId> <listingId/> <sellerId/> <shippingRestrictions/> </product> <product> <sku>967192</sku> <productId>98613</productId> <listingId/> <sellerId/> <shippingRestrictions/> </product> </products> ' # parse p <- xmlParse(xml) # get nodes nodes <- xpathApply(p, '//product') # return a random sample of notes nodes[sample(seq_along(nodes), 2)] 

Here is the result:

 > nodes[sample(seq_along(nodes), 2)] [[1]] <product> <sku>967191</sku> <productId>98612</productId> <listingId/> <sellerId/> <shippingRestrictions/> </product> [[2]] <product> <sku>967190</sku> <productId>98611</productId> <listingId/> <sellerId/> <shippingRestrictions/> </product> 
+1
source share

All Articles