Text processing with tm-idf R package for Large Corpus - cannot find common words

So, I am analyzing a huge case. It contains about 40,000 documents. I am trying to parse it using the R tm package. I created a document term matrix that reads 100% Sparsity, which means there are no common words in this corpus.

library(qdap)
library(SnowballC)
library(dplyr)
library(tm)

docs <-Corpus((DirSource(cname)))

docs <-tm_map(docs,content_transformer(tolower))
docs <-tm_map(docs,content_transformer(removeNumbers))
docs <-tm_map(docs,content_transformer(removePunctuation))
docs <-tm_map(docs,removeWords,stopwords("english"))
#docs<- tm_map(docs, stemDocument)
dtm <- DocumentTermMatrix(docs)

<<DocumentTermMatrix (documents: 39373, terms: 108065)>>
Non-/sparse entries: 2981619/4251861626
Sparsity           : 100%
Maximal term length: 93
Weighting          : term frequency (tf)

I deleted all infrequent words and I got this:

dtms <- removeSparseTerms(dtm, 0.1)
dim(dtms)
[1] 39373     0

Is R that way because my body is too big?

Update

So, I pretty much understand this problem. This seems to be a parallel computing problem. I'm not quite sure. But I came across these distributions that talk about the distribution of smart text in r Link

Additional updates *

, , , . - kaggle . , stackoverflow . , . tm R .

+4

All Articles