How to make SQL style disjoint or set difference on two Pandas DataFrame objects?

I am trying to use Pandas to solve the problem with the idiotic DBA without backing up the now broken dataset, so I am trying to find the differences between the two columns. For reasons that I wonโ€™t fall into, I use Pandas rather than a database.

What I would like to do is:

Dataset A = [A, B, C, D, E] Dataset B = [C, D, E, F] 

I would like to find values โ€‹โ€‹that do not overlap.

 Dataset A!=B = [A, B, F] 

In SQL, this is the standard set logic, performed differently depending on the dialect, but a standard function. How can I elegantly apply this in Pandas? I would like to introduce some code, but nothing that I have is even remotely correct. This is a situation in which I donโ€™t know what I donโ€™t know ..... Pandas set the logic for intersection and union, but nothing for the disjunct / installation difference.

Thanks!

+7
source share
2 answers

You can use the set.symmetric_difference function:

 In [1]: df1 = DataFrame(list('ABCDE'), columns=['x']) In [2]: df1 Out[2]: x 0 A 1 B 2 C 3 D 4 E In [3]: df2 = DataFrame(list('CDEF'), columns=['y']) In [4]: df2 Out[4]: y 0 C 1 D 2 E 3 F In [5]: set(df1.x).symmetric_difference(df2.y) Out[5]: set(['A', 'B', 'F']) 
+8
source

Here's a solution for multiple columns, maybe not very efficient, I would like to get some feedback on this faster:

 input = pd.DataFrame({'A': [1, 2, 2, 3, 3], 'B': ['a', 'a', 'b', 'a', 'c']}) limit = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']}) def set_difference(input_set, limit_on_set): limit_on_set_sub = limit_on_set[['A', 'B']] limit_on_tuples = [tuple(x) for x in limit_on_set_sub.values] limit_on_dict = dict.fromkeys(limit_on_tuples, 1) entries_in_limit = input_set.apply(lambda row: (row['A'], row['B']) in limit_on_dict, axis=1) return input_set[~entries_in_limit] >>> set_difference(input, limit) item user 1 a 2 3 a 3 
0
source

All Articles