Trying to remove words from DocumentTermMatrix to use topicmodels

So, I'm trying to use the topicmodels package for R (100 topics on the body of ~ 6400 documents, each of which is 1000 words). The process starts and then dies, I think, because it runs out of memory.

So, I am trying to reduce the size of the document matrix, which is performed by the lda() function as input; I suppose I can do this using the minDocFreq function when I generate document matrices. But when I use it, it does not seem to make any difference. Here is the code:

Here is the corresponding bit of code:

 > corpus <- Corpus(DirSource('./chunks/'),fileEncoding='utf-8') > dtm <- DocumentTermMatrix(corpus) > dim(dtm) [1] 6423 4163 # So, I assume this next command will make my document term matrix smaller, ie # fewer columns. I've chosen a larger number, 100, to illustrate the point. > smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100)) > dim(smaller) [1] 6423 41613 

The same sizes and the same number of columns (i.e. the same number of terms).

Does it make sense what I'm doing wrong? Thanks.

+4
source share
1 answer

The answer to your question is here: fooobar.com/questions/785827 / ... (bring it up!)

In short, more recent versions of the tm package do not include minDocFreq , but use bounds instead, for example, your

 smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100)) 

should now be

 require(tm) data("crude") smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(5,Inf)))) dim(smaller) # after Terms that appear in <5 documents are discarded [1] 20 67 smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(10,Inf)))) dim(smaller) # after Terms that appear in <10 documents are discarded [1] 20 17 
+12
source

All Articles