I use tm
and lda
in R for the news corpus model theme. However, I get a "uncharacteristic" problem, presented as ""
, which messed up my topics. Here is my workflow:
text <- Corpus(VectorSource(d$text)) newtext <- lapply(text, tolower) sw <- c(stopwords("english"), "ahram", "online", "egypt", "egypts", "egyptian") newtext <- lapply(newtext, function(x) removePunctuation(x)) newtext <- lapply(newtext, function(x) removeWords(x, sw)) newtext <- lapply(newtext, function(x) removeNumbers(x)) newtext <- lapply(newtext, function(x) stripWhitespace(x)) d$processed <- unlist(newtext) corpus <- lexicalize(d$processed) k <- 40 result <-lda.collapsed.gibbs.sampler(corpus$documents, k, corpus$vocab, 500, .02, .05, compute.log.likelihood = TRUE, trace = 2L)
Unfortunately, when I train the lda model, everything looks great, except that the most common word is "". I am trying to fix this by deleting it from the dictionary as below, and reorient the model as above:
newtext <- lapply(newtext, function(x) removeWords(x, ""))
But he is still there, as evidenced by:
str_split(newtext[[1]], " ") [[1]] [1] "" "body" "mohamed" "hassan" [5] "cook" "found" "turkish" "search" [9] "rescue" "teams" "rescued" "hospital" [13] "rescue" "teams" "continued" "search" [17] "missing" "body" "cook" "crew" [21] "wereegyptians" "sudanese" "syrians" "hassan" [25] "cook" "cargo" "ship" "sea" [29] "bright" "crashed" "thursday" "port" [33] "antalya" "southern" "turkey" "vessel" [37] "collided" "rocks" "port" "thursday" [41] "night" "result" "heavy" "winds" [45] "waves" "crew" ""
Any suggestions on how to do this? Adding ""
to my stop word list also does not help.