Removing an "empty" character element from the document body in R?

I use tm and lda in R for the news corpus model theme. However, I get a "uncharacteristic" problem, presented as "" , which messed up my topics. Here is my workflow:

 text <- Corpus(VectorSource(d$text)) newtext <- lapply(text, tolower) sw <- c(stopwords("english"), "ahram", "online", "egypt", "egypts", "egyptian") newtext <- lapply(newtext, function(x) removePunctuation(x)) newtext <- lapply(newtext, function(x) removeWords(x, sw)) newtext <- lapply(newtext, function(x) removeNumbers(x)) newtext <- lapply(newtext, function(x) stripWhitespace(x)) d$processed <- unlist(newtext) corpus <- lexicalize(d$processed) k <- 40 result <-lda.collapsed.gibbs.sampler(corpus$documents, k, corpus$vocab, 500, .02, .05, compute.log.likelihood = TRUE, trace = 2L) 

Unfortunately, when I train the lda model, everything looks great, except that the most common word is "". I am trying to fix this by deleting it from the dictionary as below, and reorient the model as above:

 newtext <- lapply(newtext, function(x) removeWords(x, "")) 

But he is still there, as evidenced by:

 str_split(newtext[[1]], " ") [[1]] [1] "" "body" "mohamed" "hassan" [5] "cook" "found" "turkish" "search" [9] "rescue" "teams" "rescued" "hospital" [13] "rescue" "teams" "continued" "search" [17] "missing" "body" "cook" "crew" [21] "wereegyptians" "sudanese" "syrians" "hassan" [25] "cook" "cargo" "ship" "sea" [29] "bright" "crashed" "thursday" "port" [33] "antalya" "southern" "turkey" "vessel" [37] "collided" "rocks" "port" "thursday" [41] "night" "result" "heavy" "winds" [45] "waves" "crew" "" 

Any suggestions on how to do this? Adding "" to my stop word list also does not help.

+4
source share
2 answers

I deal with text a lot, but not tm, so these are 2 approaches to get rid of "you". probably extra characters "because of a double space between sentences. You can treat this condition before or after you turn the text into a bag of words. You can replace all" x2 "with" "x1 before strsplit, or you could do this later (you need to cancel after strsplit).

 x <- "I like to ride my bicycle. Do you like to ride too?" #TREAT BEFORE(OPTION): a <- gsub(" +", " ", x) strsplit(a, " ") #TREAT AFTER OPTION: y <- unlist(strsplit(x, " ")) y[!y%in%""] 

You can also try:

 newtext <- lapply(newtext, function(x) gsub(" +", " ", x)) 

Again I do not use tm, so this may not help, but this post did not see any action, so I decided that I would share the possibilities.

+4
source

If you already have a body installed, try using the length of the document as a filter by attaching it to meta () as a tag, and then creating a new body.

 dtm <- DocumentTermMatrix(corpus) ## terms per document doc.length = rowSums(as.matrix(dtm)) ## add length as description term meta(corpus.clean.noTL,tag="Length") <- doc.length ## create new corpus corpus.noEmptyDocs <- tm_filter(corpus, FUN = sFilter, "Length > 0") ## remove Length as meta tag meta(corpus.clean.noTL,tag="Length") <- NULL 

Using the method described above, you can efficiently calculate the effectiveness of the existing matrix processing support in tm with only 5 lines of code.

+1
source

Source: https://habr.com/ru/post/1411175/


All Articles