How to count duplicate rows in pandas dataframe?

I am trying to count duplicates of each type of string in my framework. For example, let's say that I have a dataframe in pandas as follows:

df = pd.DataFrame({'one' : pd.Series([1., 1, 1]), 'two' : pd.Series([1., 2., 1] )}) 

I get a df that looks like this:

  one two 0 1 1 1 1 2 2 1 1 

I guess the first step is to find all the different unique strings that I do:

 df.drop_duplicates() 

This gives me the following df:

  one two 0 1 1 1 1 2 

Now I want to take each line from the above df ([1 1] and [1 2]) and count the number of times in each initial df. My result will look something like this:

 Row Count [1 1] 2 [1 2] 1 

How do I take this last step?

Edit:

Here is a more detailed example:

 df = pd.DataFrame({'one' : pd.Series([True, True, True, False]), 'two' : pd.Series([True, False, False, True] ), 'three' : pd.Series([True, False, False, False] )}) 

gives me:

  one three two 0 True True True 1 True False False 2 True False False 3 False False True 

I need a result that tells me:

  Row Count [True True True] 1 [True False False] 2 [False False True] 1 
+22
python pandas
source share
5 answers

You can groupby in all columns and call the size index indicates duplicate values:

 In [28]: df.groupby(df.columns.tolist(),as_index=False).size() Out[28]: one three two False False True 1 True False False 2 True True 1 dtype: int64 
+26
source share

This is what you really need:

 df = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'}) one two count 0 1 1 2 1 1 2 1 
+12
source share
 df = pd.DataFrame({'one' : pd.Series([1., 1, 1, 3]), 'two' : pd.Series([1., 2., 1, 3] ), 'three' : pd.Series([1., 2., 1, 2] )}) df['str_list'] = df.apply(lambda row: ' '.join([str(int(val)) for val in row]), axis=1) df1 = pd.DataFrame(df['str_list'].value_counts().values, index=df['str_list'].value_counts().index, columns=['Count']) 

It produces:

 >>> df1 Count 1 1 1 2 3 2 3 1 1 2 2 1 

If index values ​​should be a list, you can do the above code one more step:

df1.index = df1.index.str.split()

It produces:

  Count [1, 1, 1] 2 [3, 2, 3] 1 [1, 2, 2] 1 
+2
source share

If you want to count duplicates in specific columns:

 len(df['one'])-len(df['one'].drop_duplicates()) 

If you want to count duplicates on the entire data frame:

 len(df)-len(df.drop_duplicates()) 

Or simply you can use DataFrame.duplicated (subset = None, keep = 'first') :

 df.duplicated(subset='one', keep='first').sum() 

Where

subset : column label or label sequence (all columns are used by default)

keep : {'first,' last, False}, default 'first

  • first: mark duplicates as True, except for the first occurrence.
  • last: mark duplicates as True, except for the last occurrence.
  • False: mark all duplicates as True.
+1
source share

None of the existing answers offers a simple solution that returns "the number of lines that are only duplicates and should be cut." This is a universal solution that:

 # generate a table of those culprit rows which are duplicated: dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'}) # sum the final col of that table, and subtract the number of culprits: dups['count'].sum() - dups.shape[0] 
0
source share

All Articles