Update:
I think I might have a workaround to solve this problem, just add one code: dtms = removeSparseTerms(dtm,0.1) It will remove the sparse character in the case. But I think this is ONLY a workaround, still awaiting expert feedback!
Recently, I am learning text processing in R using the tm package. And I have an idea to draw a word cloud about words in my ABAP program at maximum frequency. So I wrote an R program to figure this out.
library(tm) library(SnowballC) library(wordcloud) # set path path = system.file("texts","abapcode",package = "tm") # make corpus code = Corpus(DirSource(path),readerControl = list(language = "en")) # cleanse text code = tm_map(code,stripWhitespace) code = tm_map(code,removeWords,stopwords("en")) code = tm_map(code,removePunctuation) code = tm_map(code,removeNumbers) # make DocumentTermMatrix dtm = DocumentTermMatrix(code) #freqency freq = sort(colSums(as.matrix(dtm)),decreasing = T) #wordcloud(code,scale = c(5,1),max.words = 50,random.order = F,colors = brewer.pal(8, "Dark2"),rot.per = 0.35,use.r.layout = F) wordcloud(names(freq),freq,scale = c(5,1),max.words = 50,random.order = F,colors = brewer.pal(8, "Dark2"),rot.per = 0.35,use.r.layout = F)
But in my ABAP code, some variants contain "_" and "-" in the variant name, so if I did this:
code = tm_map(code,removePunctuation)
The contents of the corpus are not so correct, and thus the word cloud looks like this: 
Some words are so strange if you delete "_" or "-".
And then I will comment that code and word cloud: 
This time the words are true, but some unexpected character appeared, for example, my code battalion ABAP ...
Do we have some methods that can accurately remove the punctuation we donβt want and save the ones we need?