Python Fuzzy Matching (FuzzyWuzzy) - Keep Only the Best Match

I am trying to introduce fuzzy matches into two csv files, each of which contains one column of names that are similar to each other but not the same.

My code so far looks like this:

import pandas as pd from pandas import DataFrame from fuzzywuzzy import process import csv save_file = open('fuzzy_match_results.csv', 'w') writer = csv.writer(save_file, lineterminator = '\n') def parse_csv(path): with open(path,'r') as f: reader = csv.reader(f, delimiter=',') for row in reader: yield row if __name__ == "__main__": ## Create lookup dictionary by parsing the products csv data = {} for row in parse_csv('names_1.csv'): data[row[0]] = row[0] ## For each row in the lookup compute the partial ratio for row in parse_csv("names_2.csv"): #print(process.extract(row,data, limit = 100)) for found, score, matchrow in process.extract(row, data, limit=100): if score >= 60: print('%d%% partial match: "%s" with "%s" ' % (score, row, found)) Digi_Results = [row, score, found] writer.writerow(Digi_Results) save_file.close() 

The output is as follows:

 Name11 , 90 , Name25 Name11 , 85 , Name24 Name11 , 65 , Name29 

The script works fine. The conclusion will be as expected. But I'm only looking for a better match.

 Name11 , 90 , Name25 Name12 , 95 , Name21 Name13 , 98 , Name22 

So I need to somehow reset the duplicate names in column 1, based on the highest value in column 2. This should be pretty simple, but I can't figure it out. Any help would be appreciated.

+4
source share
3 answers

fuzzywuzzy process.extract() returns the list in sorting order in reverse order, with the best match being the first.

to find the best match, you can set the limit argument to 1 , so that it returns only the best match, and if it is greater than 60, you can write it in csv, just like you do now.

Example -

 ## For each row in the lookup compute the partial ratio for row in parse_csv("names_2.csv"): for found, score, matchrow in process.extract(row, data, limit=1): if score >= 60: print('%d%% partial match: "%s" with "%s" ' % (score, row, found)) Digi_Results = [row, score, found] writer.writerow(Digi_Results) 
+4
source

Several code snippets can be greatly simplified using process.extractOne() from FuzzyWuzzy. Not only does it just return the top match, you can set an assessment threshold for it in the function call, instead of performing a separate logical step, for example:

 process.extractOne(row, data, score_cutoff = 60) 

This function will return the tuple of highest match plus a companion score if it finds a match that satisfies the condition. It will return None otherwise.

+3
source

I just wrote the same thing for myself, but in pandas ....

 import pandas as pd import numpy as np from fuzzywuzzy import fuzz from fuzzywuzzy import process d1={1:'Tim','2':'Ted',3:'Sally',4:'Dick',5:'Ethel'} d2={1:'Tam','2':'Tid',3:'Sally',4:'Dicky',5:'Aardvark'} df1=pd.DataFrame.from_dict(d1,orient='index') df2=pd.DataFrame.from_dict(d2,orient='index') df1.columns=['Name'] df2.columns=['Name'] def match(Col1,Col2): overall=[] for n in Col1: result=[(fuzz.partial_ratio(n, n2),n2) for n2 in Col2 if fuzz.partial_ratio(n, n2)>50 ] if len(result): result.sort() print('result {}'.format(result)) print("Best M={}".format(result[-1][1])) overall.append(result[-1][1]) else: overall.append(" ") return overall print(match(df1.Name,df2.Name)) 

In this I used a threshold of 50, but it is configurable.

Dataframe1 looks like

  Name 1 Tim 2 Ted 3 Sally 4 Dick 5 Ethel 

And Dataframe2 looks like

 Name 1 Tam 2 Tid 3 Sally 4 Dicky 5 Aardvark 

Thus, the launch is performed with the account

 ['Tid', 'Tid', 'Sally', 'Dicky', ' '] 

Hope this helps.

0
source

All Articles