Vectorize join condition in pandas

Question

Vectorize join condition in pandas

This code is working as expected. But large data frames take a lot of time.

for i in excel_df['name_of_college_school'] : for y in mysql_df['college_name'] : if SequenceMatcher(None, i.lower(), y.lower() ).ratio() > 0.8: excel_df.loc[excel_df['name_of_college_school'] == i, 'dupmark4'] = y

I think I cannot use the function in the join clause to compare such values. How to do this for vectorization?

Update:

Is the highest rated update possible? This loop will overwrite the previous match, and it is possible that an earlier match was more relevant than the current one.

+7

pandas

shantanuo 18 sept '17 at 6:32

source share

2 answers

siddharth iyer · Answer 1 · 2017-09-18T06:49:11+0000

What you are looking for is a fuzzy merger.

 a = excel_df.as_matrix() b = mysql_df.as_matrix() for i in a: for j in b: if SequenceMatcher(None, i[college_index_a].lower(), y[college_index_b].lower() ).ratio() > 0.8: i[dupmark_index] = j

Never use loc in a loop; it has huge overhead. And btw, get the index of the corresponding columns (numeric). Use it -

 df.columns.get_loc("college name")

Zero · Answer 2 · 2017-09-18T06:50:13+0000

You can avoid one of the loops using the apply operations and instead of .loc , now these will be M.

 for y in mysql_df['college_name']: match = excel_df['name_of_college_school'].apply(lambda x: SequenceMatcher( None, x.lower(), y.lower()).ratio() > 0.8) excel_df.loc[match, 'dupmark4'] = y

Vectorize join condition in pandas

More articles: