How to remove pandas framework from another data frame

Question

How to remove pandas framework from another data frame

How to remove the pandas framework from another data frame, as well as subtraction:

a=[1,2,3,4,5] b=[1,5] ab=[2,3,4]

And now we have two pandas frames, how to remove df2 from df1:

 In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b']) In [6]: df1 Out[6]: ab 0 1 2 1 3 4 2 5 6 In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b']) In [10]: df2 Out[10]: ab 0 1 2 1 5 6

Then we expect the result of df1-df2 to be:

 In [14]: df Out[14]: ab 0 3 4

How to do it?

Thanks.

+17

python pandas dataframe subtraction

176coding May 19 '16 at 3:54

source share

8 answers

piRSquared · Answer 1 · 2016-05-19T04:27:47+0000

Decision

Use pd.concat and then drop_duplicates(keep=False)

 pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

Similar to

  ab 1 3 4

Explanation

pd.concat adds two DataFrame together, adding one immediately after the other. if there is any overlap, it will be captured using the drop_duplicates method. However, drop_duplicates leaves the first drop_duplicates by default and deletes all other cases. In this case, we want to delete each duplicate. Therefore, the keep=False parameter, which does just that.

Special note on re- df2 . Only one df2 any line from df2 not in df1 will not be considered a duplicate and will remain. This solution only works with one df2 only when df2 is a subset of df1 . However, if we concatenate << 29> twice, it is guaranteed to be a duplicate and subsequently deleted.

Stefan · Answer 2 · 2016-05-19T19:04:56+0000

You can use .duplicated , whose advantage is that it is quite expressive:

 %%timeit combined = df1.append(df2) combined[~combined.index.duplicated(keep=False)] 1000 loops, best of 3: 875 µs per loop

For comparison:

 %timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only'] 100 loops, best of 3: 4.57 ms per loop %timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False) 1000 loops, best of 3: 987 µs per loop %timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)] 1000 loops, best of 3: 546 µs per loop

In np.array , using np.array comparison is the fastest. .tolist() not needed .tolist() .

piRSquared · Answer 3 · 2016-05-19T08:32:39+0000

A logical approach with plenty. Turn the lines df1 and df2 into sets. Then use set subtraction to define a new DataFrame

 idx1 = set(df1.set_index(['a', 'b']).index) idx2 = set(df2.set_index(['a', 'b']).index) pd.DataFrame(list(idx1 - idx2), columns=df1.columns) ab 0 3 4

knagaev · Answer 4 · 2016-05-19T09:43:44+0000

My snapshot is merging df1 and df2 from the question.

Using the indicator parameter

 In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only'] Out[74]: ab 1 3 4

piRSquared · Answer 5 · 2016-05-19T08:43:31+0000

Masking approach

 df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)] ab 1 3 4

Peter Abdou · Answer 6 · 2018-10-25T11:40:37+0000

I think the first tolist() should be removed, but leave the second:

 df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)]

frozen shine · Answer 7 · 2018-11-14T21:17:33+0000

The easiest option is to use indexes.

Add df1 and df2 and reset their indices.
df = df1.concat(df2)
df.reset_index(inplace=True)
eg:
This will give the df2 indices
indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]

I hope this helps new readers, although the question is laid out a little back :)

Eva vW · Answer 8 · 2019-07-11T20:37:59+0000

You can use a combination of concat and drop_duplicates

 df1 = pd.concat([df1, df2], sort=False).drop_duplicates(keep=False)

Combine with concat, and then discard all overlaps by setting keep=False .
sort=False is only included to prevent future pandas warnings.

How to remove pandas framework from another data frame

Decision

Explanation

More articles: