Create a unique identifier by fuzzy name matching (via agrep using R)

Using R, I'm trying to match the names of people in a dataset structured by year and city. Due to some spelling errors, an exact match is not possible, so I'm trying to use agrep () for fuzzy match names.

An exemplary fragment of a data set is structured as follows:

df <- data.frame(matrix( c("1200013","1200013","1200013","1200013","1200013","1200013","1200013","1200013", "1996","1996","1996","1996","2000","2000","2004","2004","AGUSTINHO FORTUNATO FILHO","ANTONIO PEREIRA NETO","FERNANDO JOSE DA COSTA","PAULO CEZAR FERREIRA DE ARAUJO","PAULO CESAR FERREIRA DE ARAUJO","SEBASTIAO BOCALOM RODRIGUES","JOAO DE ALMEIDA","PAULO CESAR FERREIRA DE ARAUJO"), ncol=3,dimnames=list(seq(1:8),c("citycode","year","candidate")) )) 

Pure version:

  citycode year candidate 1 1200013 1996 AGUSTINHO FORTUNATO FILHO 2 1200013 1996 ANTONIO PEREIRA NETO 3 1200013 1996 FERNANDO JOSE DA COSTA 4 1200013 1996 PAULO CEZAR FERREIRA DE ARAUJO 5 1200013 2000 PAULO CESAR FERREIRA DE ARAUJO 6 1200013 2000 SEBASTIAO BOCALOM RODRIGUES 7 1200013 2004 JOAO DE ALMEIDA 8 1200013 2004 PAULO CESAR FERREIRA DE ARAUJO 

I would like to check each city separately, whether there are candidates appearing in a few years. For example. in the example

PAULO CEZAR FERREIRA DE ARAUJO

PAULO CESAR FERREIRA DE ARAUJO

appears twice (with spelling error). Each candidate throughout the data set must be assigned a unique numeric identifier for the candidate. The data set is quite large (5500 cities, about 100 thousand records), so somewhat efficient coding would be useful. Any suggestions on how to implement this?

EDIT: Here is my attempt (with the help of comments so far), which is very slow (inefficient) in achieving the task. Any suggestions for improving this?

 f <- function(x) {matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE) levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))] x } temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE) df$candidatenew <- unlist(temp) df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1) 

EDIT 2: Now works at a good speed. The problem was comparing with many factors at every step (thanks for pointing this out, Blue Master). Decreasing the comparison only with candidates in one group (i.e., in the city) executes the command in 5 seconds for 80,000 lines - at the speed with which I can live.

 df$candidate <- as.character(df$candidate) f <- function(x) {x <- as.factor(x) matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE) levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))] as.character(x) } temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE) df$candidatenew <- unlist(temp) df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1) 
+8
string-matching r agrep fuzzy
source share
2 answers

Here is my shot at him. This is probably not very effective, but I think he will do his job. I assume df$candidates has a class factor.

 #fuzzy matches candidate names to other candidate names #compares each pair of names only once ##by looking at names that have a greater index matches <- unlist(lapply(1:(length(levels(df[["candidate"]]))-1), function(x) {max(x,x + agrep( pattern=levels(df[["candidate"]])[x], x=levels(df[["candidate"]])[-seq_len(x)] ))} )) #assigns new levels (omits the last level because that doesn't change) levels(df[["candidate"]])[-length(levels(df[["candidate"]]))] <- levels(df[["candidate"]])[matches] 
+4
source share

Well, given that the focus is on efficiency, I would suggest the following.

First, note that in order of effectiveness from first principles, we can predict that an exact match will be much faster than grep, which will be faster than fuzzy grep. Exact match, then fuzzy grep for the rest of the observations.

Secondly, vectorize and avoid loops. The apply commands are not necessarily faster, so stick with the built-in vectorization if you can. All grep commands are vectorized initially, but it will be difficult to avoid a *ply or loop to compare each element with the vector of the others so that they match.

Third, use external information to narrow down the problem. Fuzzy matching by name only within each city or state, which will significantly reduce the number of comparisons that need to be made, for example.

You can combine the first and third principles: you can even try an exact match on the first character of each line, and then a fuzzy match inside it.

+2
source share

All Articles