How to remove special characters, spaces and trim in a single line a character variable in R

Question

How to remove special characters, spaces and trim in a single line a character variable in R

I have a little problem in R with a variable, which is a character type. My variable in the data frame has this structure:

X1 ANGLO AUTOMOTRIZ SA MATRIZ AUTOMOTORES Y ANEXOS / AYASA ECUA - AUTO SA MATRIZ METROCAR SA 10 DE AGOSTO MOSUMI LA "Y"

My problem is that I want a new variable without ./-"" , and the lines should be grouped into one without spaces:

 X2 ANGLOAUTOMOTRIZSAMATRIZ AUTOMOTORESYANEXOSAYASA ECUAAUTOSAMATRIZ METROCARSA10DEAGOSTO MOSUMILAY

This can be done in R. Thanks.

+7

regex r

Duck Sep 06 '13 at 14:41

source share

2 answers

Since you are also dealing with accented characters, I can present two options:

Completely get rid of accented characters.
Use iconv to try to "transliterate" characters with an emphasis on ASCII characters.

Here is one and the other. In both examples, I use the following sample text:

 Z <- c("ANGLO AUTOMOTRIZ SA MATRIZ", "AUTOMOTORES Y ANEXOS / AYASA", "ECUA - AUTO SA MATRIZ", "METROCAR SA 10 DE AGOSTO", "MOSUMI LA \"Y\"", "distribuir contenidos", "proponer autoevaluaciones", "como buzón de actividades")

Option 1: Note that the accented “-” is discarded in the last element.

 gsub("[^[:ascii:]]|[[:punct:]]|[[:space:]]", "", Z, perl=TRUE) # [1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ" # [4] "METROCARSA10DEAGOSTO" "MOSUMILAY" "distribuircontenidos" # [7] "proponerautoevaluaciones" "comobuzndeactividades"

Option 2: Note that the "-" has been converted to "o"

 gsub("[[:punct:]]|[[:space:]]", "", iconv(Z, to = "ASCII//TRANSLIT")) # [1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ" # [4] "METROCARSA10DEAGOSTO" "MOSUMILAY" "distribuircontenidos" # [7] "proponerautoevaluaciones" "comobuzondeactividades"

Notes:

For convenience, I decided to use the character classes [[:punct:]] and [[:space:]] .
For the first option, you need perl = TRUE recognize the character class [[:ascii:]] .
^ in option 1 means “no” (so you can read it as “find something that is not an ASCII character, that is, a space or punctuation mark, and replace it with nothing).

+6

A5C1D2H2I1M1N2O1R2T1 Sep 06 '13 at 16:53

source share

Simon O'Hanlon · Accepted Answer · 2013-09-06T14:48:06+0000

Try gsub ...

 gsub( "\\.|/|\\-|\"|\\s" , "" , df$X1 ) #[1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ" #[4] "METROCARSA10DEAGOSTO" "MOSUMILAY"

\\. - match literal .
| - separator OR
/ - matches a / (no shielding required)
\\- - match literal -
\" - match literal "
\\s - match spaces

gsub is greedy, so it tries to combine as much as it can, and it will also be vectorized so that you can just pass the entire column at once. The second argument is the replacement value, which in this case is "" , which does not replace any matching characters with anything.

How to remove special characters, spaces and trim in a single line a character variable in R

More articles: