TermDocumentMatrix sometimes throws an error

I am creating a Word Cloud based on tweets from various sports teams. This code successfully runs about 1 time in 10 times:

handle <- 'arsenal' txt <- searchTwitter(handle,n=1000,lang='en') t <- sapply(txt,function(x) x$getText()) t <- gsub('http.*\\s*|RT|Retweet','',t) t <- gsub(handle,'',t) t_c <- Corpus(VectorSource(t)) tdm = TermDocumentMatrix(t_c,control = list(removePunctuation = TRUE,stopwords = stopwords("english"),removeNumbers = TRUE, content_transformer(tolower))) m = as.matrix(tdm) word_freqs = sort(rowSums(m), decreasing=TRUE) dm = data.frame(word=names(word_freqs), freq=word_freqs) wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"),rot.per=0.5) 

Other 9 out of 10 times, it throws the following error:

 Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : 'i, j, v' different lengths In addition: Warning messages: 1: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code 2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : NAs introduced by coercion 

Any ideas guys? I googled, but still not enough! Keep in mind that I am absolutely new to R!

+7
r term-document-matrix word-cloud
source share
3 answers

So, after a little game, the following line of code completely fixed my problem:

 t <- iconv(t,to="utf-8-mac") 
+5
source share

Suppose you used the following line of code somewhere before using the DocumentTermMatrix command.

 corpus = tm_map(corpus, PlainTextDocument) 

This line of code converts all text in corpus to PlainTextDocument, on which the DocumentTermMatrix function does not work correctly.

Just repeat the whole process of creating the case and pre-process it by skipping the above command, and you will be well off.

+2
source share

If you remove:

 corpus = tm_map(corpus, PlainTextDocument) 

you also need to remove:

 t_c <- Corpus(VectorSource(t)) 

You will then get the correct output for TermDocumentMatrix .

0
source share

All Articles