Get the number of unique rows in a pandas dataframe

Question

Get the number of unique rows in a pandas dataframe

I have a Pandas DataFrame -

>>> import numpy as np >>> import pandas as pd >>> data = pd.DataFrame(np.random.randint(low=0, high=2,size=(5,3)), ... columns=['A', 'B', 'C']) >>> data ABC 0 0 1 0 1 1 0 1 2 1 0 1 3 0 1 1 4 1 1 0

Now I use this to get the row count for column A only

 >>> data.ix[:, 'A'].value_counts() 1 3 0 2 dtype: int64

What is the most efficient way to get the row count for columns A and B, for example the following output:

 0 0 0 0 1 2 1 0 2 1 1 1

And finally, how can I convert it to a numpy array such as -

 array([[0, 2], [2, 1]])

Please give a solution that is also consistent with

 >>>> data = pd.DataFrame(np.random.randint(low=0, high=2,size=(5,2)), ... columns=['A', 'B'])

+6

python numpy pandas

Yashu seth Dec 13 '15 at 20:20

source share

3 answers

You can use groupby in columns A and B and then count . But with this, you will only get the values that you have in the original frame. In your case, you will not have 0 0 counters. After that, you can call the values method to get a numpy array:

 In [52]: df Out[52]: ABC 0 0 1 0 1 1 0 1 2 1 0 1 3 0 1 1 4 1 1 0 In [56]: df.groupby(['A', 'B'], as_index=False).count() Out[56]: ABC 0 0 1 2 1 1 0 2 2 1 1 1 In [57]: df.groupby(['A', 'B'], as_index=False).count().C.values Out[57]: array([2, 2, 1])

Then you can use the numpy array reshape method

For a data frame with all values:

 In [71]: df Out[71]: ABC 0 1 0 1 1 1 1 1 2 1 0 1 3 1 1 0 4 0 1 1 5 0 0 1 6 1 1 1 7 0 0 1 8 0 1 0 9 1 1 0 In [73]: df.groupby(['A', 'B'], as_index=False).count() Out[73]: ABC 0 0 0 2 1 0 1 2 2 1 0 2 3 1 1 4 In [75]: df.groupby(['A', 'B'], as_index=False).count().C.values.reshape(2,2) Out[75]: array([[2, 2], [2, 4]])

0

Anton Protopopov Dec 13 '15 at 20:34

source share

Assuming all your data is binary, you can just sum the columns. To be safe, you then use count to get the total number of all non-zero values in the column (the difference between this count and the previous sum is the number of zeros).

 s = data[['A', 'B']].sum().values >>> np.matrix([s, data[['A', 'B']].count().values - s]) matrix([[3, 3], [2, 2]]

If you are sure that there are no null values, you can save some computational time simply by taking the number of lines from the first parameter of the form.

 >>> np.matrix([s, data.shape[0] - s]) matrix([[3, 3], [2, 2]]

0

Alexander Dec 13 '15 at 21:09

source share

Andy hayden · Accepted Answer · 2015-12-13T21:20:48+0000

You can use groupby size and then unstack :

 In [11]: data.groupby(["A","B"]).size() Out[11]: AB 0 1 2 1 0 2 1 1 dtype: int64 In [12]: data.groupby(["A","B"]).size().unstack("B") Out[12]: B 0 1 A 0 NaN 2 1 2 1 In [13]: data.groupby(["A","B"]).size().unstack("B").fillna(0) Out[13]: B 0 1 A 0 0 2 1 2 1

However , when you make a group followed by a screed, you should think: pivot_table :

 In [21]: data.pivot_table(index="A", columns="B", aggfunc="count", fill_value=0) Out[21]: C B 0 1 A 0 0 2 1 2 1

This will be the most effective solution, as well as the most immediate.

Get the number of unique rows in a pandas dataframe

More articles: