Punctuation removal, excluding apostrophes and intraslot dashes in R

I know how to separately remove punctuation and keep apostrophes:

gsub( "[^[:alnum:]']", " ", db$text )  

or how to save a dash in text input mode using the tm package:

removePunctuation(db$text, preserve_intra_word_dashes = TRUE)

but I can’t find a way to do this at the same time. For example, if my initial sentence is:

"Interested in energy/the environment/etc.? Congrats to our new e-board! Ben, Nathan, Jenny, and Adam, y'all are sure to lead the club in a great direction next year! #obama #swag"

I would like to:

"Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"

Of course, there will be extra spaces, but I can remove them later.

I would be grateful for your help.

+3
source share
2 answers

Use character classes

gsub("[^[:alnum:]['-]", " ", db$text)

## "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
+9
source

I like the David Arenberg'sanswer. If you need another way, you can try:

library(qdap)

text <- "Interested in energy/the environment/etc.? Congrats to our new e-board! Ben, Nathan, Jenny, and Adam, y'all are sure to lead the club in a great direction next year! #obama #swag"

gsub("/", " ",strip(text, char.keep=c("-","/"), apostrophe.remove=F,lower.case=F))
#[1] "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"

or

library(gsubfn)
 clean(gsubfn("[[:punct:]]", function(x) ifelse(x=="'","'",ifelse(x=="-","-"," ")),text))
#[1] "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"

clean - qdap.

+3

All Articles