I have two data frames, each of which has a different number of rows. Below is a couple of rows from each dataset.
df1 = Company City State ZIP FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 LACKEY SHEET METAL St. Louis MO 63102
and
df2 = FDA Company FDA City FDA State FDA ZIP LACKEY SHEET METAL St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530 HELGET GAS PRODUCTS INC Omaha NE 68127 ORTHOQUEST LLC La Vista NE 68128
I attached them side by side using combined_data = pandas.concat([df1, df2], axis = 1) . My next goal is to compare each line under df1['Company'] with each line in df2 df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the best match value and its name. I want to save this in a new column. For example, if I made fuzz.ratio and fuzz.token_sort_ratio on the LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] , it would return that the best match was LACKY SHEET METAL with a score of 100 , and this then it will be saved to a new column in combined data . Results will look like
combined_data = Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100 CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530 GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127 LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
I tried to do
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
But an error was received because the lengths of the columns are different.
I'm at a dead end. How can i do this?
source share