How can I group thousands of documents using the R tm package?

I have about 25,000 documents that need to be clustered, and I was hoping I could use the R tm package. Unfortunately, I am running out of memory of about 20,000 documents. The following function shows what I'm trying to do using dummy data. I run out of memory when I call a function with n = 20 on a computer with 16 GB of RAM. Are there any optimizations I could do?

Thanks for any help.

make_clusters <- function(n) {
    require(tm)
    require(slam)
    docs <- unlist(lapply(letters[1:n],function(x) rep(x,1000)))
    tdf <- TermDocumentMatrix(Corpus(VectorSource(docs)),control=list(weighting=weightTfIdf,wordLengths=c(1,Inf)))
    tdf.norm <- col_norms(tdf)
    docs.simil <- crossprod_simple_triplet_matrix(tdf,tdf)/outer(tdf.norm,tdf.norm)
    hh <- hclust(as.dist(1-docs.simil))
}
+1
source share

No one has answered this question yet.


All Articles