How to remove pandas framework from another data frame

How to remove the pandas framework from another data frame, as well as subtraction:

a=[1,2,3,4,5] b=[1,5] ab=[2,3,4] 

And now we have two pandas frames, how to remove df2 from df1:

 In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b']) In [6]: df1 Out[6]: ab 0 1 2 1 3 4 2 5 6 In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b']) In [10]: df2 Out[10]: ab 0 1 2 1 5 6 

Then we expect the result of df1-df2 to be:

 In [14]: df Out[14]: ab 0 3 4 

How to do it?

Thanks.

+17
source share
8 answers

Decision

Use pd.concat and then drop_duplicates(keep=False)

 pd.concat([df1, df2, df2]).drop_duplicates(keep=False) 

Similar to

  ab 1 3 4 

Explanation

pd.concat adds two DataFrame together, adding one immediately after the other. if there is any overlap, it will be captured using the drop_duplicates method. However, drop_duplicates leaves the first drop_duplicates by default and deletes all other cases. In this case, we want to delete each duplicate. Therefore, the keep=False parameter, which does just that.

Special note on re- df2 . Only one df2 any line from df2 not in df1 will not be considered a duplicate and will remain. This solution only works with one df2 only when df2 is a subset of df1 . However, if we concatenate << 29> twice, it is guaranteed to be a duplicate and subsequently deleted.

+30
source

You can use .duplicated , whose advantage is that it is quite expressive:

 %%timeit combined = df1.append(df2) combined[~combined.index.duplicated(keep=False)] 1000 loops, best of 3: 875 ยตs per loop 

For comparison:

 %timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only'] 100 loops, best of 3: 4.57 ms per loop %timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False) 1000 loops, best of 3: 987 ยตs per loop %timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)] 1000 loops, best of 3: 546 ยตs per loop 

In np.array , using np.array comparison is the fastest. .tolist() not needed .tolist() .

+6
source

A logical approach with plenty. Turn the lines df1 and df2 into sets. Then use set subtraction to define a new DataFrame

 idx1 = set(df1.set_index(['a', 'b']).index) idx2 = set(df2.set_index(['a', 'b']).index) pd.DataFrame(list(idx1 - idx2), columns=df1.columns) ab 0 3 4 
+2
source

My snapshot is merging df1 and df2 from the question.

Using the indicator parameter

 In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only'] Out[74]: ab 1 3 4 
+2
source

Masking approach

 df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)] ab 1 3 4 
+1
source

I think the first tolist() should be removed, but leave the second:

 df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)] 
0
source

The easiest option is to use indexes.

  1. Add df1 and df2 and reset their indices.

    df = df1.concat(df2)
    df.reset_index(inplace=True)

  2. eg:
    This will give the df2 indices

    indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]

I hope this helps new readers, although the question is laid out a little back :)

0
source

You can use a combination of concat and drop_duplicates

 df1 = pd.concat([df1, df2], sort=False).drop_duplicates(keep=False) 

Combine with concat, and then discard all overlaps by setting keep=False .
sort=False is only included to prevent future pandas warnings.

0
source

All Articles