Tm integrates the list of cases

I have a list of URLs for which I selected webcontent and included it in tm corpora:

library(tm) library(XML) link <- c( "http://www.r-statistics.com/tag/hadley-wickham/", "http://had.co.nz/", "http://vita.had.co.nz/articles.html", "http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html", "http://www.analyticstory.com/hadley-wickham/" ) create.corpus <- function(url.name){ doc=htmlParse(url.name) parag=xpathSApply(doc,'//p',xmlValue) if (length(parag)==0){ parag="empty" } cc=Corpus(VectorSource(parag)) meta(cc,"link")=url.name return(cc) } link=catch$url cc <- lapply(link, create.corpus) 

This gives me a “large list” of cases, one for each URL. Combining them one at a time works:

 x=cc[[1]] y=cc[[2]] z=c(x,y,recursive=T) # preserved metadata x;y;z # A corpus with 8 text documents # A corpus with 2 text documents # A corpus with 10 text documents 

But it becomes impracticable for a list with several thousand cases. So, how can you combine the list of cases into one case while saving metadata?

+6
source share
3 answers

You can use do.call to call c :

 do.call(function(...) c(..., recursive = TRUE), cc) # A corpus with 155 text documents 
+5
source

I don't think tm offers a built-in function to merge / merge many cases. But corpus is a list of documents, since the question is how to convert a list of a list to a list. I would create a new building using all the documents, and then assign the meta manually:

 y = Corpus(VectorSource(unlist(cc))) meta(y,'link') = do.call(rbind,lapply(cc,meta))$link 
+2
source

Your code does not work because catch not defined, so I don’t know exactly what it should do.

But now tm corpora can simply be put into a vector to create one big case: https://www.rdocumentation.org/packages/tm/versions/0.7-1/topics/tm_combine

So maybe c(unlist(cc)) will work. I have no way to check if this will work because your code is not working.

0
source

All Articles