Parse XML files (> 1 megabyte) in R

Question

Parse XML files (> 1 megabyte) in R

Currently, I have ~ 20,000 XML files ranging in size from a few kilobytes to several megabytes. Although this may not be ideal, I use the "xmlTreeParse" function in the XML package to scroll through each of the files and extract the text I need and save the document as a csv file.

The code below is great for files <1 MB in size:

files <- list.files() for (i in files) { doc <- xmlTreeParse(i, useInternalNodes = TRUE) root <- xmlRoot(doc) name <- xmlValue(root[[8]][[1]][[1]]) # Name data <- xmlValue(root[[8]][[1]]) # Full text x <- data.frame(c(name)) x$data <- data write.csv(x, paste(i, ".csv"), row.names=FALSE, na="") }

The problem is that any file> 1 MB gives me the following error:

 Excessive depth in document: 256 use XML_PARSE_HUGE option Extra content at the end of the document Error: 1: Excessive depth in document: 256 use XML_PARSE_HUGE option 2: Extra content at the end of the document

Please forgive my ignorance, however I tried to find the XML_PARSE_HUGE function in the XML package and cannot find it. Has anyone had experience using this feature? If so, I would really appreciate any advice on how to get this code to process several large XML files.

Thanks!

+2

xml r xml-parsing

Entropy Jun 17 '13 at 18:29

source share

1 answer

user1609452 · Accepted Answer · 2013-06-18T02:40:10+0000

To select "XML_PARSE_HUGE", you need to specify it in the parameters. XML:::parserOptions lists the options:

 > XML:::parserOptions RECOVER NOENT DTDLOAD DTDATTR DTDVALID NOERROR NOWARNING 1 2 4 8 16 32 64 PEDANTIC NOBLANKS SAX1 XINCLUDE NONET NODICT NSCLEAN 128 256 512 1024 2048 4096 8192 NOCDATA NOXINCNODE COMPACT OLD10 NOBASEFIX HUGE OLDSAX 16384 32768 65536 131072 262144 524288 1048576

eg

 > HUGE [1] 524288

It is enough to declare a vector of integers with any of these parameters. In your case

 xmlTreeParse(i, useInternalNodes = TRUE, options = HUGE)

Parse XML files (> 1 megabyte) in R

More articles: