Replace specific characters in a variable in a data frame in R

I want to replace everything,, - , ) , ( and (space) to . from the variable DMA.NAME in the sample data frame. I referred to three posts and tried my approaches, but all failed:

Replacing column values ​​in a non-list data frame

R replace all specific values ​​in the data frame

Replace characters from column of data frame R

Approach 1

 > shouldbecomeperiod <- c$DMA.NAME %in% c("-", ",", " ", "(", ")") c$DMA.NAME[shouldbecomeperiod] <- "." 

Approach 2

 > removetext <- c("-", ",", " ", "(", ")") c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME) c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME, fixed = TRUE) Warning message: In gsub(removetext, ".", c$DMA.NAME) : argument 'pattern' has length > 1 and only the first element will be used 

Approach 3

 > c[c == c(" ", ",", "(", ")", "-")] <- "." 

Data frame example

 > df DMA.CODE DATE DMA.NAME count 111 22 8/14/2014 12:00:00 AM Columbus, OH 1 112 23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn 1 79 18 7/30/2014 12:00:00 AM Boston (Manchester) 1 99 22 8/20/2014 12:00:00 AM Columbus, OH 1 112.1 23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn 1 208 27 7/31/2014 12:00:00 AM Minneapolis-St. Paul 1 

I know the problem - gsub uses the template and only the first element. Two other approaches look for the whole variable for the exact value instead of searching within the value for certain characters.

+4
source share
2 answers

You can use the special groups [:punct:] and [:space:] inside the template group ( [...] ) as follows:

 df <- data.frame( DMA.NAME = c( "Columbus, OH", "Orlando-Daytona Bch-Melbrn", "Boston (Manchester)", "Columbus, OH", "Orlando-Daytona Bch-Melbrn", "Minneapolis-St. Paul"), stringsAsFactors=F) ## > gsub("[[:punct:][:space:]]+","\\.",df$DMA.NAME) [1] "Columbus.OH" "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester." "Columbus.OH" [5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul" 
+4
source

If your data frame is large, you might want to view this quick function from the stringi package. This function replaces each character of a particular class with another. In this case, the character class is L - letters (inside {} ), but a large P (before {} ) indicates that we are looking for additions to this set, so for every character that is not a letter. Merge indicates that consecutive matches should be combined into one.

 require(stringi) stri_replace_all_charclass(df$DMA.NAME, "\\P{L}",".", merge=T) ## [1] "Columbus.OH" "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester." "Columbus.OH" ## [5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul" 

And some guidelines:

 x <- sample(df$DMA.NAME, 1000, T) gsubFun <- function(x){ gsub("[[:punct:][:space:]]+","\\.",x) } striFun <- function(x){ stri_replace_all_charclass(x, "\\P{L}",".", T) } require(microbenchmark) microbenchmark(gsubFun(x), striFun(x)) Unit: microseconds expr min lq median uq max neval gsubFun(x) 3472.276 3511.0015 3538.097 3573.5835 11039.984 100 striFun(x) 877.259 893.3945 907.769 929.8065 3189.017 100 
+3
source

All Articles