I have a list of URLs for which I selected webcontent and included it in tm corpora:
library(tm) library(XML) link <- c( "http://www.r-statistics.com/tag/hadley-wickham/", "http://had.co.nz/", "http://vita.had.co.nz/articles.html", "http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html", "http://www.analyticstory.com/hadley-wickham/" ) create.corpus <- function(url.name){ doc=htmlParse(url.name) parag=xpathSApply(doc,'//p',xmlValue) if (length(parag)==0){ parag="empty" } cc=Corpus(VectorSource(parag)) meta(cc,"link")=url.name return(cc) } link=catch$url cc <- lapply(link, create.corpus)
This gives me a “large list” of cases, one for each URL. Combining them one at a time works:
x=cc[[1]] y=cc[[2]] z=c(x,y,recursive=T)
But it becomes impracticable for a list with several thousand cases. So, how can you combine the list of cases into one case while saving metadata?
source share