Currently, I have ~ 20,000 XML files ranging in size from a few kilobytes to several megabytes. Although this may not be ideal, I use the "xmlTreeParse" function in the XML package to scroll through each of the files and extract the text I need and save the document as a csv file.
The code below is great for files <1 MB in size:
files <- list.files() for (i in files) { doc <- xmlTreeParse(i, useInternalNodes = TRUE) root <- xmlRoot(doc) name <- xmlValue(root[[8]][[1]][[1]]) # Name data <- xmlValue(root[[8]][[1]]) # Full text x <- data.frame(c(name)) x$data <- data write.csv(x, paste(i, ".csv"), row.names=FALSE, na="") }
The problem is that any file> 1 MB gives me the following error:
Excessive depth in document: 256 use XML_PARSE_HUGE option Extra content at the end of the document Error: 1: Excessive depth in document: 256 use XML_PARSE_HUGE option 2: Extra content at the end of the document
Please forgive my ignorance, however I tried to find the XML_PARSE_HUGE function in the XML package and cannot find it. Has anyone had experience using this feature? If so, I would really appreciate any advice on how to get this code to process several large XML files.
Thanks!
source share