How to accurately remove punctuation when using R with tm package

Update:

I think I might have a workaround to solve this problem, just add one code: dtms = removeSparseTerms(dtm,0.1) It will remove the sparse character in the case. But I think this is ONLY a workaround, still awaiting expert feedback!


Recently, I am learning text processing in R using the tm package. And I have an idea to draw a word cloud about words in my ABAP program at maximum frequency. So I wrote an R program to figure this out.

 library(tm) library(SnowballC) library(wordcloud) # set path path = system.file("texts","abapcode",package = "tm") # make corpus code = Corpus(DirSource(path),readerControl = list(language = "en")) # cleanse text code = tm_map(code,stripWhitespace) code = tm_map(code,removeWords,stopwords("en")) code = tm_map(code,removePunctuation) code = tm_map(code,removeNumbers) # make DocumentTermMatrix dtm = DocumentTermMatrix(code) #freqency freq = sort(colSums(as.matrix(dtm)),decreasing = T) #wordcloud(code,scale = c(5,1),max.words = 50,random.order = F,colors = brewer.pal(8, "Dark2"),rot.per = 0.35,use.r.layout = F) wordcloud(names(freq),freq,scale = c(5,1),max.words = 50,random.order = F,colors = brewer.pal(8, "Dark2"),rot.per = 0.35,use.r.layout = F) 

But in my ABAP code, some variants contain "_" and "-" in the variant name, so if I did this:

 code = tm_map(code,removePunctuation) 

The contents of the corpus are not so correct, and thus the word cloud looks like this: enter image description here

Some words are so strange if you delete "_" or "-".

And then I will comment that code and word cloud: enter image description here

This time the words are true, but some unexpected character appeared, for example, my code battalion ABAP ...

Do we have some methods that can accurately remove the punctuation we don’t want and save the ones we need?

+6
source share
2 answers

Posting as an answer for formatting the code, but this is an adaptation from the content_transformer documentation found from getTransformtions found in the tm_map documentation:

It basically uses gsub in content_transformer to do the same thing as removePunctuation minus _ and - (class [:punct:] regex). removePunctuation has the ability to save dashes - but not save underscores _ .

 f <- content_transformer(function(x, pattern) gsub(pattern, "", x)) code <- tm_map(code, f, "[!\"#$%&'*+,./)(:;<=> ?@ \][\\^`{|}~]") 

In the character class, you need to exit \ , " and the closing parenthesis ] .

+4
source

Well ... the following works ... Convert the enclosure to a data frame, delete unnecessary characters and subsequently convert them to the enclosure ...

dataframe<-data.frame(text=unlist(sapply(code, [ , "content")), stringsAsFactors=F) dataframe$text <- gsub("[][!#$%()*,.:;<=>@^_|~.{}]", "", dataframe$text)

code <- corpus(Vectorsource(dataframe$text))

0
source

All Articles