How to remove special characters, spaces and trim in a single line a character variable in R

I have a little problem in R with a variable, which is a character type. My variable in the data frame has this structure:

X1 ANGLO AUTOMOTRIZ SA MATRIZ AUTOMOTORES Y ANEXOS / AYASA ECUA - AUTO SA MATRIZ METROCAR SA 10 DE AGOSTO MOSUMI LA "Y" 

My problem is that I want a new variable without ./-"" , and the lines should be grouped into one without spaces:

 X2 ANGLOAUTOMOTRIZSAMATRIZ AUTOMOTORESYANEXOSAYASA ECUAAUTOSAMATRIZ METROCARSA10DEAGOSTO MOSUMILAY 

This can be done in R. Thanks.

+7
regex r
source share
2 answers

Try gsub ...

 gsub( "\\.|/|\\-|\"|\\s" , "" , df$X1 ) #[1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ" #[4] "METROCARSA10DEAGOSTO" "MOSUMILAY" 
  • \\. - match literal .
  • | - separator OR
  • / - matches a / (no shielding required)
  • \\- - match literal -
  • \" - match literal "
  • \\s - match spaces

gsub is greedy, so it tries to combine as much as it can, and it will also be vectorized so that you can just pass the entire column at once. The second argument is the replacement value, which in this case is "" , which does not replace any matching characters with anything.

+13
source share

Since you are also dealing with accented characters, I can present two options:

  • Completely get rid of accented characters.
  • Use iconv to try to "transliterate" characters with an emphasis on ASCII characters.

Here is one and the other. In both examples, I use the following sample text:

 Z <- c("ANGLO AUTOMOTRIZ SA MATRIZ", "AUTOMOTORES Y ANEXOS / AYASA", "ECUA - AUTO SA MATRIZ", "METROCAR SA 10 DE AGOSTO", "MOSUMI LA \"Y\"", "distribuir contenidos", "proponer autoevaluaciones", "como buzΓ³n de actividades") 

Option 1: Note that the accented β€œ-” is discarded in the last element.

 gsub("[^[:ascii:]]|[[:punct:]]|[[:space:]]", "", Z, perl=TRUE) # [1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ" # [4] "METROCARSA10DEAGOSTO" "MOSUMILAY" "distribuircontenidos" # [7] "proponerautoevaluaciones" "comobuzndeactividades" 

Option 2: Note that the "-" has been converted to "o"

 gsub("[[:punct:]]|[[:space:]]", "", iconv(Z, to = "ASCII//TRANSLIT")) # [1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ" # [4] "METROCARSA10DEAGOSTO" "MOSUMILAY" "distribuircontenidos" # [7] "proponerautoevaluaciones" "comobuzondeactividades" 

Notes:

  • For convenience, I decided to use the character classes [[:punct:]] and [[:space:]] .
  • For the first option, you need perl = TRUE recognize the character class [[:ascii:]] .
  • ^ in option 1 means β€œno” (so you can read it as β€œfind something that is not an ASCII character, that is, a space or punctuation mark, and replace it with nothing).
+6
source share

All Articles