Pandas: discard duplicates in groupby 'date'

In the data frame below, I would like to remove duplicate cid values ​​so that the output from df.groupby('date').cid.size() with the output df.groupby('date').cid.nunique() from df.groupby('date').cid.nunique() ,

I looked at this post, but it doesn't seem to have a solid solution to the problem.

 df = pd.read_csv('https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df') df.groupby('date').cid.size() date 2005 7 2006 237 2007 3610 2008 1318 2009 2664 2010 997 2011 6390 2012 2904 2013 7875 2014 3979 df.groupby('date').cid.nunique() date 2005 3 2006 10 2007 227 2008 52 2009 142 2010 57 2011 219 2012 99 2013 238 2014 146 Name: cid, dtype: int64 

Things I tried:

  1. df.groupby([df['date']]).drop_duplicates(cols='cid') throws this error: AttributeError: Cannot access callable attribute 'drop_duplicates' of 'DataFrameGroupBy' objects, try using the 'apply' method
  2. df.groupby(('date').drop_duplicates('cid')) throws this error: AttributeError: 'str' object has no attribute 'drop_duplicates'
+6
source share
1 answer

You do not need groupby to remove duplicates based on multiple columns, you can specify a subset instead:

 df2 = df.drop_duplicates(["date", "cid"]) df2.groupby('date').cid.size() Out[99]: date 2005 3 2006 10 2007 227 2008 52 2009 142 2010 57 2011 219 2012 99 2013 238 2014 146 dtype: int64 
+17
source

All Articles