Since you are also dealing with accented characters, I can present two options:
- Completely get rid of accented characters.
- Use
iconv to try to "transliterate" characters with an emphasis on ASCII characters.
Here is one and the other. In both examples, I use the following sample text:
Z <- c("ANGLO AUTOMOTRIZ SA MATRIZ", "AUTOMOTORES Y ANEXOS / AYASA", "ECUA - AUTO SA MATRIZ", "METROCAR SA 10 DE AGOSTO", "MOSUMI LA \"Y\"", "distribuir contenidos", "proponer autoevaluaciones", "como buzΓ³n de actividades")
Option 1: Note that the accented β-β is discarded in the last element.
gsub("[^[:ascii:]]|[[:punct:]]|[[:space:]]", "", Z, perl=TRUE) # [1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ" # [4] "METROCARSA10DEAGOSTO" "MOSUMILAY" "distribuircontenidos" # [7] "proponerautoevaluaciones" "comobuzndeactividades"
Option 2: Note that the "-" has been converted to "o"
gsub("[[:punct:]]|[[:space:]]", "", iconv(Z, to = "ASCII//TRANSLIT")) # [1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ" # [4] "METROCARSA10DEAGOSTO" "MOSUMILAY" "distribuircontenidos" # [7] "proponerautoevaluaciones" "comobuzondeactividades"
Notes:
- For convenience, I decided to use the character classes
[[:punct:]] and [[:space:]] . - For the first option, you need
perl = TRUE recognize the character class [[:ascii:]] . ^ in option 1 means βnoβ (so you can read it as βfind something that is not an ASCII character, that is, a space or punctuation mark, and replace it with nothing).
A5C1D2H2I1M1N2O1R2T1
source share