How to count duplicate rows in pandas dataframe?

Question

How to count duplicate rows in pandas dataframe?

I am trying to count duplicates of each type of string in my framework. For example, let's say that I have a dataframe in pandas as follows:

df = pd.DataFrame({'one' : pd.Series([1., 1, 1]), 'two' : pd.Series([1., 2., 1] )})

I get a df that looks like this:

  one two 0 1 1 1 1 2 2 1 1

I guess the first step is to find all the different unique strings that I do:

 df.drop_duplicates()

This gives me the following df:

  one two 0 1 1 1 1 2

Now I want to take each line from the above df ([1 1] and [1 2]) and count the number of times in each initial df. My result will look something like this:

 Row Count [1 1] 2 [1 2] 1

How do I take this last step?

Edit:

Here is a more detailed example:

 df = pd.DataFrame({'one' : pd.Series([True, True, True, False]), 'two' : pd.Series([True, False, False, True] ), 'three' : pd.Series([True, False, False, False] )})

gives me:

  one three two 0 True True True 1 True False False 2 True False False 3 False False True

I need a result that tells me:

  Row Count [True True True] 1 [True False False] 2 [False False True] 1

+22

python pandas

jss367 Feb 23 '16 at 17:21

source share

5 answers

This is what you really need:

 df = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'}) one two count 0 1 1 2 1 1 2 1

+12

Denis Dec 21 '16 at 18:21

source share

 df = pd.DataFrame({'one' : pd.Series([1., 1, 1, 3]), 'two' : pd.Series([1., 2., 1, 3] ), 'three' : pd.Series([1., 2., 1, 2] )}) df['str_list'] = df.apply(lambda row: ' '.join([str(int(val)) for val in row]), axis=1) df1 = pd.DataFrame(df['str_list'].value_counts().values, index=df['str_list'].value_counts().index, columns=['Count'])

It produces:

 >>> df1 Count 1 1 1 2 3 2 3 1 1 2 2 1

If index values should be a list, you can do the above code one more step:

df1.index = df1.index.str.split()

It produces:

  Count [1, 1, 1] 2 [3, 2, 3] 1 [1, 2, 2] 1

+2

Jarad Feb 23 '16 at 18:55

source share

If you want to count duplicates in specific columns:

 len(df['one'])-len(df['one'].drop_duplicates())

If you want to count duplicates on the entire data frame:

 len(df)-len(df.drop_duplicates())

Or simply you can use DataFrame.duplicated (subset = None, keep = 'first') :

 df.duplicated(subset='one', keep='first').sum()

Where

subset : column label or label sequence (all columns are used by default)

keep : {'first,' last, False}, default 'first

first: mark duplicates as True, except for the first occurrence.
last: mark duplicates as True, except for the last occurrence.
False: mark all duplicates as True.

+1

Arash Nov 29 '18 at 20:31

source share

None of the existing answers offers a simple solution that returns "the number of lines that are only duplicates and should be cut." This is a universal solution that:

 # generate a table of those culprit rows which are duplicated: dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'}) # sum the final col of that table, and subtract the number of culprits: dups['count'].sum() - dups.shape[0]

0

olisteadman Dec 12 '18 at 15:51

source share

Edchum · Accepted Answer · 2016-02-23T17:51:20+0000

You can groupby in all columns and call the size index indicates duplicate values:

 In [28]: df.groupby(df.columns.tolist(),as_index=False).size() Out[28]: one three two False False True 1 True False False 2 True True 1 dtype: int64

How to count duplicate rows in pandas dataframe?

More articles: