Python Pandas Drop Duplicates keeps second place

Question

Python Pandas Drop Duplicates keeps second place

What is the most efficient way to select the second and last of each duplicate set in a pandas frame?

For example, I basically want to perform this operation:

df = df.drop_duplicates(['Person','Question'],take_last=True)

But this:

 df = df.drop_duplicates(['Person','Question'],take_second_last=True)

Abstract question: how to choose which duplicate to keep if the duplicate is neither maximum nor minimum?

+7

python pandas

David yang Aug 15 '16 at 14:27

source share

2 answers

You can groupby/tail(2) take the last 2 elements, then groupby/head(1) take the first element from the tail:

 df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)

If there is only one element in the group, tail(2) returns only one element.

For example,

 import numpy as np import pandas as pd df = pd.DataFrame(np.random.randint(10, size=(10**2, 3)), columns=list('ABC')) result = df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1) expected = (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True)) assert expected.sort_index().equals(result)

Built-in group methods (such as tail and head ) are often much faster than groupby/apply with custom Python functions. This is especially true if there are many groups:

 In [96]: %timeit df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1) 1000 loops, best of 3: 1.7 ms per loop In [97]: %timeit (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True)) 100 loops, best of 3: 17.9 ms per loop

As an alternative, ayhan offers a nice improvement:

 alt = df.groupby(['A','B']).tail(2).drop_duplicates(['A','B']) assert expected.sort_index().equals(alt) In [99]: %timeit df.groupby(['A','B']).tail(2).drop_duplicates(['A','B']) 1000 loops, best of 3: 1.43 ms per loop

+2

unutbu Aug 16 '16 at 0:45

source share

user2285236 · Accepted Answer · 2016-08-15T14:46:50+0000

From groupby.apply:

 df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 'B': np.arange(10), 'C': np.arange(10)}) df Out: ABC 0 1 0 0 1 1 1 1 2 1 2 2 3 1 3 3 4 2 4 4 5 2 5 5 6 2 6 6 7 3 7 7 8 3 8 8 9 4 9 9 (df.groupby('A', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]) .reset_index(level=0, drop=True)) Out: ABC 2 1 2 2 5 2 5 5 7 3 7 7 9 4 9 9

With another DataFrame, a subset of two columns:

 df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 'B': [1, 1, 2, 1, 2, 2, 2, 3, 3, 4], 'C': np.arange(10)}) df Out: ABC 0 1 1 0 1 1 1 1 2 1 2 2 3 1 1 3 4 2 2 4 5 2 2 5 6 2 2 6 7 3 3 7 8 3 3 8 9 4 4 9 (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]) .reset_index(level=0, drop=True)) Out: ABC 1 1 1 1 2 1 2 2 5 2 2 5 7 3 3 7 9 4 4 9

Python Pandas Drop Duplicates keeps second place

More articles: