So, I am analyzing a huge case. It contains about 40,000 documents. I am trying to parse it using the R tm package. I created a document term matrix that reads 100% Sparsity, which means there are no common words in this corpus.
library(qdap)
library(SnowballC)
library(dplyr)
library(tm)
docs <-Corpus((DirSource(cname)))
docs <-tm_map(docs,content_transformer(tolower))
docs <-tm_map(docs,content_transformer(removeNumbers))
docs <-tm_map(docs,content_transformer(removePunctuation))
docs <-tm_map(docs,removeWords,stopwords("english"))
dtm <- DocumentTermMatrix(docs)
<<DocumentTermMatrix (documents: 39373, terms: 108065)>>
Non-/sparse entries: 2981619/4251861626
Sparsity : 100%
Maximal term length: 93
Weighting : term frequency (tf)
I deleted all infrequent words and I got this:
dtms <- removeSparseTerms(dtm, 0.1)
dim(dtms)
[1] 39373 0
Is R that way because my body is too big?
Update
So, I pretty much understand this problem. This seems to be a parallel computing problem. I'm not quite sure. But I came across these distributions that talk about the distribution of smart text in r Link
Additional updates *
, , , . - kaggle . , stackoverflow
. , . tm R .