Excessive depth in the document: XML_PARSE_HUGE option for xml2 :: read_html () in R

At first, I would like to apologize for the new question, as my profile still does not allow me to comment on other people's comments, especially on the two SO posts that I saw. So please bring this older guy :-)

I am trying to read a list of 100 character files ranging in size from 90 KB to 2 MB, and then using the qdap package qdap make some statistics with the text that I extract from the files, namely, counting sentences, words, etc. Files using RSelenium::remoteDriver$getPageSource() , which was previously cleared by the original page and saved to the file using write(pgSource, fileName.txt) . I read the files in a loop using:

 pgSource <- readChar(file.path(fPath, fileNames[i]), nchars = 1e6) doc <- read_html(pgSource) 

that for some files is thrown

 Error in eval(substitute(expr), envir, enclos) : Excessive depth in document: 256 use XML_PARSE_HUGE option [1] 

I saw these posts, SO33819103 and SO31419409 , that point to similar issues, but cannot fully understand how to use the @shabbychef workaround as suggested in both posts using the snippet suggested by @glossarch in the first link above.

 library(drat) drat:::add("shabbychef"); install.packages('xml2') library("xml2") 

EDIT: I noticed that when I ran another script earlier that scraped real-time data from web pages using a URL, I did not encounter this problem. The code was the same, I just read doc <- read_html(pgSource) after reading it from RSelenium remoteDriver .

What I would like to ask in this gentle community is whether I follow the correct steps to install and download xml2 after adding shabbychef drat or do I need to add some other step as suggested in SO17154308 . Any help or suggestions are welcome. Thanks.

+6
source share
1 answer

I donโ€™t know if this is the right thing to do, but @hrbrmstr answered my question in one of his comments. I decided to publish the answer so that people stumbling over this question would see that he has at least one answer.

The problem was mainly solved using the "HUGE" option when reading the HTML source. My problem is only with loading a previously saved source. I did not find the same problem when using the "live" version of the application, i.e. Directly read the source from the website.

In any case, now updating in the excellent xml2 package in August 2016 allows you to use the HUGE parameter as follows:

 doc <- read_html(pageSource, options = "HUGE") 

For more information, please read the xml2 reference manual here. CRAN-xml2

I want to thank @hrbrmstr again for his valuable contribution.

+5
source

All Articles