Instead of reading the entire file, you can use event parsing with closure , which handles the nodes of interest to you. To get there, I'll start with a random sampling strategy from a file. The process writes one at a time. If the entry i th is less than or equal to the number n entries to save it, save it with probability n / i . It can be implemented as
i <- 0L; n <- 10L select <- function() { i <<- i + 1L if (i <= n) i else { if (runif(1) < n / i) sample(n, 1) else 0 } }
which behaves as follows:
> i <- 0L; n <- 10L; replicate(20, select()) [1] 1 2 3 4 5 6 7 8 9 10 1 5 7 0 1 9 0 2 1 0
This tells us about saving the first 10 elements, then we replace element 1 with element 11, element 5 with element 12, element 7 with element 13, then discard the 14th element, etc. Substitutions become less frequent when I become much larger than n.
We use this as part of the product handler, which pre-allocates space for the results we are interested in, and each time the โproductโ node is encountered, we check whether it should be selected, and if so, add it to our current results in the appropriate place
sku <- character(n) product <- function(p) { i <- select() if (i) sku[[i]] <<- xmlValue(p[["sku"]]) NULL }
The "select" and "product" handlers are combined with a ( get ) function that allows us to retrieve the current values, and all of them are placed in closure, so that we have a factory view template that encapsulates the variables n , i and sku
sampler <- function(n) { force(n) # otherwise lazy evaluation could lead to surprises i <- 0L select <- function() { i <<- i + 1L if (i <= n) { i } else { if (runif(1) < n / i) sample(n, 1) else 0 } } sku <- character(n) product <- function(p) { i <- select() if (i) sku[[i]] <<- xmlValue(p[["sku"]]) NULL } list(product=product, get=function() list(sku=sku)) }
And then we are ready to go
products <- xmlTreeParse("foo.xml", handler=sampler(1000)) as.data.frame(products$get())
As soon as the number of processed nodes i becomes large relative to n , it will scale linearly with the file size, so you can understand how well it works quite well, starting with subsets of the original file.