Apply fuzzy match on dataframe column and save the results in a new column

I have two data frames, each of which has a different number of rows. Below is a couple of rows from each dataset.

df1 = Company City State ZIP FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 LACKEY SHEET METAL St. Louis MO 63102 

and

 df2 = FDA Company FDA City FDA State FDA ZIP LACKEY SHEET METAL St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530 HELGET GAS PRODUCTS INC Omaha NE 68127 ORTHOQUEST LLC La Vista NE 68128 

I attached them side by side using combined_data = pandas.concat([df1, df2], axis = 1) . My next goal is to compare each line under df1['Company'] with each line in df2 df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the best match value and its name. I want to save this in a new column. For example, if I made fuzz.ratio and fuzz.token_sort_ratio on the LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] , it would return that the best match was LACKY SHEET METAL with a score of 100 , and this then it will be saved to a new column in combined data . Results will look like

 combined_data = Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100 CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530 GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127 LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128 

I tried to do

 combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1) 

But an error was received because the lengths of the columns are different.

I'm at a dead end. How can i do this?

+5
source share
1 answer

I could not say what you were doing. This is how I do it.

 from fuzzywuzzy import fuzz from fuzzywuzzy import process 

Create a series of tuples for comparison:

 compare = pd.MultiIndex.from_product([df1['Company'], df2['FDA Company']]).to_series() 

Create a special function to calculate fuzzy metrics and return a series.

 def metrics(tup): return pd.Series([fuzz.ratio(*tup), fuzz.token_sort_ratio(*tup)], ['ratio', 'token']) 

Apply metrics to compare series

 compare.apply(metrics) 

enter image description here

There are several ways to do the following part:

Get the closest matches to each line df1

 compare.apply(metrics).unstack().idxmax().unstack(0) 

enter image description here

Get the closest matches to each df2 line

 compare.apply(metrics).unstack(0).idxmax().unstack(0) 

enter image description here

+7
source

All Articles