For two python lists of the same length. How to return the best matches of the same values?

Below are two python lists with lines in them (people names):

list_1 = ['J. Payne', 'George Bush', 'Billy Idol', 'M Stuart', 'Luc van den Bergen'] list_2 = ['John Payne', 'George W. Bush', 'Billy Idol', 'M. Stuart', 'Luc Bergen'] 

I need to match the names that are most similar.

 'J. Payne' -> 'John Payne' 'George Bush' -> 'George W. Bush' 'Billy Idol' -> 'Billy Idol' 'M Stuart' -> 'M. Stuart' 'Luc van den Bergen' -> 'Luc Bergen' 

Is there a neat way to do this in python? Lists contain an average of 5 or 6 names. Sometimes more, but it’s rare. Sometimes this is just one name on each list, which can be spelled slightly differently.

+7
python string list mapping
source share
3 answers

You can try difflib :

 import difflib list_1 = ['J. Payne', 'George Bush', 'Billy Idol', 'M Stuart', 'Luc van den Bergen'] list_2 = ['John Payne', 'George W. Bush', 'Billy Idol', 'M. Stuart', 'Luc Bergen'] mymap = {} for elem in list_1: closest = difflib.get_close_matches(elem, list_2) if closest: mymap[elem] = closest[0] print mymap 

exit:

 {'George Bush': 'George W. Bush', 'Luc van den Bergen': 'Luc Bergen', 'Billy Idol': 'Billy Idol', 'J. Payne': 'John Payne', 'M Stuart': 'M. Stuart'} 
+10
source share

Using the function defined here: http://hetland.org/coding/python/levenshtein.py

 >>> for i in list_1: ... print i, '==>', min(list_2, key=lambda j:levenshtein(i,j)) ... 
 J. Payne ==> John Payne
 George Bush ==> George W. Bush
 Billy Idol ==> Billy Idol
 M Stuart ==> M. Stuart
 Luc van den Bergen ==> Luc Bergen

You can use functools.partial instead of lambda

 >>> from functools import partial >>> for i in list_1: ... print i, '==>', min(list_2, key=partial(levenshtein,i)) ... 
 J. Payne ==> John Payne
 George Bush ==> George W. Bush
 Billy Idol ==> Billy Idol
 M Stuart ==> M. Stuart
 Luc van den Bergen ==> Luc Bergen
+11
source share

Here is a variant of these solutions, which also optimizes the global minimum distance. It uses the Munkres algorithm to ensure optimized line pairs.

 from munkres import Munkres def match_lists(l1, l2): # Compute a matrix of string distances for all combinations of # items in l1 and l2. matrix = [[levenshtein(i1, i2) for i2 in l2] for i1 in l1] # Now figure out what the global minimum distance between the # pairs is. indexes = Munkres().compute(matrix) for row, col in indexes: yield l1[row], l2[col] l1 = [ 'bolton', 'manchester city', 'manchester united', 'wolves', 'liverpool', 'sunderland', 'wigan', 'norwich', 'arsenal', 'aston villa', 'chelsea', 'fulham', 'newcastle utd', 'stoke city', 'everton', 'tottenham', 'blackburn', 'west brom', 'qpr', 'swansea' ] l2 = [ 'bolton wanderers', 'manchester city', 'manchester united', 'wolverhampton', 'liverpool', 'norwich city', 'sunderland', 'wigan athletic', 'arsenal', 'aston villa', 'chelsea', 'fulham', 'newcastle united', 'stoke city', 'everton', 'tottenham hotspur', 'blackburn rovers', 'west bromwich', 'queens park rangers', 'swansea city' ] for i1, i2 in match_lists(l1, l2): print i1, '=>', i2 

For the above lists, where the differences are more related to the alternative spelling and nicknames rather than spelling errors, this method gives better results than just using levenshtein or difflib. The munkres module can be found here: http://software.clapper.org/munkres/

+2
source share

All Articles