How to subtract rows from one pandas data frame from another?

The operation I want to do is like a merge. For example, when merging inner we get a data frame that contains the rows that are present in the first and second data frames. When outer merges, we get a data frame that is EIRER in the first OR in the second data frame.

I need a data frame that contains the rows that are present in the first data frame and is NOT present in the second? Is there a quick and elegant way to do this?

+8
source share
4 answers

How about the following:

 print df1 Team Year foo 0 Hawks 2001 5 1 Hawks 2004 4 2 Nets 1987 3 3 Nets 1988 6 4 Nets 2001 8 5 Nets 2000 10 6 Heat 2004 6 7 Pacers 2003 12 print df2 Team Year foo 0 Pacers 2003 12 1 Heat 2004 6 2 Nets 1988 6 

As long as there is an unclassified common column, you can let the sufffex add do the work (if there is no non-character common column, then you can create it for temporary use ... df1['common'] = 1 and df2['common'] = 1 ):

 new = df1.merge(df2,on=['Team','Year'],how='left') print new[new.foo_y.isnull()] Team Year foo_x foo_y 0 Hawks 2001 5 NaN 1 Hawks 2004 4 NaN 2 Nets 1987 3 NaN 4 Nets 2001 8 NaN 5 Nets 2000 10 NaN 

Or you can use isin , but you will need to create one key:

 df1['key'] = df1['Team'] + df1['Year'].astype(str) df2['key'] = df1['Team'] + df2['Year'].astype(str) print df1[~df1.key.isin(df2.key)] Team Year foo key 0 Hawks 2001 5 Hawks2001 2 Nets 1987 3 Nets1987 4 Nets 2001 8 Nets2001 5 Nets 2000 10 Nets2000 6 Heat 2004 6 Heat2004 7 Pacers 2003 12 Pacers2003 
+7
source

Consider the following:

  • df_one is the first DataFrame
  • df_two - second DataFrame

Presented in the First DataFrame and Not in the second DataFrame

Solution: Index df = df_one[~df_one.index.isin(df_two.index)]

index can be replaced with the required column on which you want to make an exception. In the above example, I used the index as a link between both data frames

Alternatively, you can also use a more complex query using boolean pandas.Series for the solution above.

+6
source

You may encounter errors if your column, other than the index, has cells with NaN.

 print df1 Team Year foo 0 Hawks 2001 5 1 Hawks 2004 4 2 Nets 1987 3 3 Nets 1988 6 4 Nets 2001 8 5 Nets 2000 10 6 Heat 2004 6 7 Pacers 2003 12 8 Problem 2112 NaN print df2 Team Year foo 0 Pacers 2003 12 1 Heat 2004 6 2 Nets 1988 6 3 Problem 2112 NaN new = df1.merge(df2,on=['Team','Year'],how='left') print new[new.foo_y.isnull()] Team Year foo_x foo_y 0 Hawks 2001 5 NaN 1 Hawks 2004 4 NaN 2 Nets 1987 3 NaN 4 Nets 2001 8 NaN 5 Nets 2000 10 NaN 6 Problem 2112 NaN NaN 

The command task in 2112 does not matter for foo in any table. Thus, the left join here will falsely return this row, which matches in both DataFrames, as not present in the correct DataFrame.

Decision:

What I am doing is adding a unique column to the internal DataFrame and setting the value for all rows. Then, when you join, you can check if this column is NaN for the internal table to find unique records in the external table.

 df2['in_df2']='yes' print df2 Team Year foo in_df2 0 Pacers 2003 12 yes 1 Heat 2004 6 yes 2 Nets 1988 6 yes 3 Problem 2112 NaN yes new = df1.merge(df2,on=['Team','Year'],how='left') print new[new.in_df2.isnull()] Team Year foo_x foo_y in_df1 in_df2 0 Hawks 2001 5 NaN yes NaN 1 Hawks 2004 4 NaN yes NaN 2 Nets 1987 3 NaN yes NaN 4 Nets 2001 8 NaN yes NaN 5 Nets 2000 10 NaN yes NaN 

NB. The problem string is now properly filtered because it matters to in_df2.

  Problem 2112 NaN NaN yes yes 
+4
source

I suggest using the 'indicator' parameter in a merge. Also, if 'on' is set to None, the default is to use column intersection in both data frames.

 new = df1.merge(df2,how='left', indicator=True) # adds a new column '_merge' new = new[(new['_merge']=='left_only')].copy() #rows only in df1 and not df2 new = new.drop(columns='_merge').copy() Team Year foo 0 Hawks 2001 5 1 Hawks 2004 4 2 Nets 1987 3 4 Nets 2001 8 5 Nets 2000 10 

Link: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

 indicator : boolean or string, default False If True, adds a column to output DataFrame called "_merge" with information on the source of each row. Information column is Categorical-type and takes on a value of "left_only" for observations whose merge key only appears in 'left DataFrame, "right_only" for observations whose merge key only appears in 'right DataFrame, and "both" if the observations merge key is found in both. 
+1
source

All Articles