Using R, I'm trying to match the names of people in a dataset structured by year and city. Due to some spelling errors, an exact match is not possible, so I'm trying to use agrep () for fuzzy match names.
An exemplary fragment of a data set is structured as follows:
df <- data.frame(matrix( c("1200013","1200013","1200013","1200013","1200013","1200013","1200013","1200013", "1996","1996","1996","1996","2000","2000","2004","2004","AGUSTINHO FORTUNATO FILHO","ANTONIO PEREIRA NETO","FERNANDO JOSE DA COSTA","PAULO CEZAR FERREIRA DE ARAUJO","PAULO CESAR FERREIRA DE ARAUJO","SEBASTIAO BOCALOM RODRIGUES","JOAO DE ALMEIDA","PAULO CESAR FERREIRA DE ARAUJO"), ncol=3,dimnames=list(seq(1:8),c("citycode","year","candidate")) ))
Pure version:
citycode year candidate 1 1200013 1996 AGUSTINHO FORTUNATO FILHO 2 1200013 1996 ANTONIO PEREIRA NETO 3 1200013 1996 FERNANDO JOSE DA COSTA 4 1200013 1996 PAULO CEZAR FERREIRA DE ARAUJO 5 1200013 2000 PAULO CESAR FERREIRA DE ARAUJO 6 1200013 2000 SEBASTIAO BOCALOM RODRIGUES 7 1200013 2004 JOAO DE ALMEIDA 8 1200013 2004 PAULO CESAR FERREIRA DE ARAUJO
I would like to check each city separately, whether there are candidates appearing in a few years. For example. in the example
PAULO CEZAR FERREIRA DE ARAUJO
PAULO CESAR FERREIRA DE ARAUJO
appears twice (with spelling error). Each candidate throughout the data set must be assigned a unique numeric identifier for the candidate. The data set is quite large (5500 cities, about 100 thousand records), so somewhat efficient coding would be useful. Any suggestions on how to implement this?
EDIT: Here is my attempt (with the help of comments so far), which is very slow (inefficient) in achieving the task. Any suggestions for improving this?
f <- function(x) {matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE) levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))] x } temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE) df$candidatenew <- unlist(temp) df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)
EDIT 2: Now works at a good speed. The problem was comparing with many factors at every step (thanks for pointing this out, Blue Master). Decreasing the comparison only with candidates in one group (i.e., in the city) executes the command in 5 seconds for 80,000 lines - at the speed with which I can live.
df$candidate <- as.character(df$candidate) f <- function(x) {x <- as.factor(x) matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE) levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))] as.character(x) } temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE) df$candidatenew <- unlist(temp) df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)