Vectorize join condition in pandas

This code is working as expected. But large data frames take a lot of time.

for i in excel_df['name_of_college_school'] : for y in mysql_df['college_name'] : if SequenceMatcher(None, i.lower(), y.lower() ).ratio() > 0.8: excel_df.loc[excel_df['name_of_college_school'] == i, 'dupmark4'] = y 

I think I cannot use the function in the join clause to compare such values. How to do this for vectorization?


Update:

Is the highest rated update possible? This loop will overwrite the previous match, and it is possible that an earlier match was more relevant than the current one.

+7
pandas
source share
2 answers

What you are looking for is a fuzzy merger.

 a = excel_df.as_matrix() b = mysql_df.as_matrix() for i in a: for j in b: if SequenceMatcher(None, i[college_index_a].lower(), y[college_index_b].lower() ).ratio() > 0.8: i[dupmark_index] = j 

Never use loc in a loop; it has huge overhead. And btw, get the index of the corresponding columns (numeric). Use it -

 df.columns.get_loc("college name") 
+1
source share

You can avoid one of the loops using the apply operations and instead of .loc , now these will be M.

 for y in mysql_df['college_name']: match = excel_df['name_of_college_school'].apply(lambda x: SequenceMatcher( None, x.lower(), y.lower()).ratio() > 0.8) excel_df.loc[match, 'dupmark4'] = y 
0
source share

All Articles