Convert the names of long states nested with other text into two-letter state abbreviations

Question

Convert the names of long states nested with other text into two-letter state abbreviations

My goal is to identify US states written in a character vector that has different text and converts the states to an abbreviated form. For example, North Carolina is NK. It is simple if the vector has only state names with long forms. However, my vector has different text in random places, as in the "state" example.

states <- c("Plano New Jersey", "NC", "xyz", "Alabama 02138", "Texas", "Town Iowa 99999")

From another post, I found this:

state.abb[match(states, state.name)]

but it only converts standalone Texas

> state.abb[match(states, state.name)]
[1] NA   NA   NA   NA   "TX"

not lines in New Jersey, Alabama, and Iowa.

From Quick grep with a vector pattern or match to return a list of all matches I tried:

sapply(states, grep(pattern = state.name, x = states, value = TRUE))

but

Error in get(as.character(FUN), mode = "function", envir = envir) : 
  object 'Alabama 02138' of mode 'function' was not found
In addition: Warning message:
In grep(pattern = state.name, x = states, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used

And this does not work:

sapply(states, function(x) state.abb[grep(state.name, states)])

:

?

EDIT: , , , , "Plano New Jersey" "Plano NJ".

/ .

+3

regex grep r

lawyeR 30 . '14 12:45

5

Try:

indx <- paste0(".*(", paste(state.name, collapse="|"), ").*")
v1 <- gsub(indx, "\\1", states)
ifelse( v1 %in% state.abb, v1, state.abb[match(v1, state.name)])
#[1] "NJ" "NC" NA   "AL" "TX" "IA"

, , :

indx1 <- paste(state.name, collapse="|")   
indx2 <- state.abb[match(v1, state.name)]

mapply(gsub, indx1, indx2, states, USE.NAMES=F)
#[1] "Plano NJ"      "NC"            "xyz"           "AL 02138"     
#[5] "TX"            "Town IA 99999"

+3

akrun 30 . '14 13:24

, , , , fuil .

, st, . paste(..., collapse = "|") , , gsubfn gsubfn .

library(gsubfn)
st <- as.list(setNames(state.abb, state.name))
gsubfn(paste(state.name, collapse = "|"), st, states)

:

[1] "Plano NJ"      "NC"            "xyz"           "AL 02138"     
[5] "TX"            "Town IA 99999"

+1

G. Grothendieck 30 . '14 13:24

, mapply gsub state.name state.abb, :

mapply(gsub,state.name,state.abb,"ALABAMA 123",ignore.case=TRUE,USE.NAMES=FALSE)

, , :

 [1] "AL 123"      "ALABAMA 123" "ALABAMA 123" "ALABAMA 123" "ALABAMA 123" 
 [6] ...

taking the shortest text from this list, you can get the desired result. Thus, we sort the list by the length of the text and take the first element.

Full code:

replaceState <- function(x) {  
     v = mapply(gsub,state.name,state.abb,x,ignore.case=TRUE, USE.NAMES=FALSE)
     v[order(nchar(v))][1] 
}

sapply(states, replaceState, USE.NAMES=FALSE)

Unfortunately, this approach replaces only the name of one state (the longest). To replace several different states, we need to iterate, for example:

replaceState <- function(x) {  
     v = mapply(gsub,state.name,state.abb,x,ignore.case=TRUE, USE.NAMES=FALSE)
     v[order(nchar(v))][1] 
}

replaceStates <- function(x) {
     newX = replaceState(x)

     # if they are different a state has been replaced, 
     # we try again to replace all states.
     if(newX != x){ 
          replaceStates(newX)
     } else {
          newX
     }
}

# Note the 'replaceStates'
sapply(states, replaceStates, USE.NAMES=FALSE)

+1

ebo Aug 30 '14 at 13:45

source share

Try:

for(r in 1:nrow(states.list)) {
    states = gsub(states.list[r,1], states.list[r,2], states)
}

states
[1] "Plano NJ"      "NC"            "xyz"           "AL 02138"      "TX"            "Town IA 99999"

Data:

states <- c("Plano New Jersey", "NC", "xyz", "Alabama 02138", "Texas", "Town Iowa 99999")

states.list = structure(list(state.name = structure(c(4L, 1L, 5L, 2L, 3L), .Label = c("Alabama", 
"Iowa", "Minnesota", "New Jersey", "Texas"), class = "factor"), 
    state.abb = structure(c(4L, 1L, 5L, 2L, 3L), .Label = c("AL", 
    "IA", "MN", "NJ", "TX"), class = "factor")), .Names = c("state.name", 
"state.abb"), class = "data.frame", row.names = c(NA, -5L))

states.list
  state.name state.abb
1 New Jersey        NJ
2    Alabama        AL
3      Texas        TX
4       Iowa        IA
5  Minnesota        MN

0

rnso Aug 30 '14 at 15:24

source share

Tyler Rinker · Accepted Answer · 2014-08-30T13:53:15+0000

:

library(qdap)
mgsub(state.name, state.abb, states)

## [1] "Plano NJ"      "NC"            "xyz"           "AL 02138"      
## "TX"            "Town IA 99999"

, , :

mgsub(state.name, state.abb, states, ignore.case=TRUE, fixed=FALSE)

Convert the names of long states nested with other text into two-letter state abbreviations

More articles: