Effective counter other than DataFrame columns grouped by row

Question

Effective counter other than DataFrame columns grouped by row

What is the fastest way (within reason of reasonable pythonicity) to count different column values of the same dtype for each row in a DataFrame ?

Details: I have a DataFrame categorical results for an item (in rows) by day (in columns), similar to what is generated by the following.

 import numpy as np import pandas as pd def genSampleData(custCount, dayCount, discreteChoices): """generate example dataset""" np.random.seed(123) return pd.concat([ pd.DataFrame({'custId':np.array(range(1,int(custCount)+1))}), pd.DataFrame( columns = np.array(['day%d' % x for x in range(1,int(dayCount)+1)]), data = np.random.choice(a=np.array(discreteChoices), size=(int(custCount), int(dayCount))) )], axis=1)

For example, if the data set indicates which drink each customer orders each time they shop, I would like to know the number of different drinks for each customer.

 # notional discrete choice outcome drinkOptions, drinkIndex = np.unique(['coffee','tea','juice','soda','water'], return_inverse=True) # integer-coded discrete choice outcomes d = genSampleData(2,3, drinkIndex) d # custId day1 day2 day3 #0 1 1 4 1 #1 2 3 2 1 # Count distinct choices per subject -- this is what I want to do efficiently on larger DF d.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1) #0 2 #1 3 # Note: I have coded the choices as `int` rather than `str` to speed up comparisons. # To reconstruct the choice names, we could do: # d.iloc[:,1:] = drinkOptions[d.iloc[:,1:]]

What I tried: The datasets in this use case will have much more items than days ( testDf example below), so I tried to find the most efficient equally:

 testDf = genSampleData(100000,3, drinkIndex) #---- Original attempts ---- %timeit -n20 testDf.iloc[:,1:].apply(lambda x: x.nunique(), axis=1) # I didn't wait for this to finish -- something more than 5 seconds per loop %timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(x.unique()), axis=1) # Also too slow %timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1) #20 loops, best of 3: 2.07 s per loop

To improve my initial attempt, note that pandas.DataFrame.apply () takes an argument:

If raw=True , then the function passed will receive ndarray objects instead. If you simply apply the NumPy reduction function, this will lead to much higher performance.

This reduced execution time by more than half:

 %timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1, raw=True) #20 loops, best of 3: 721 ms per loop *best so far*

I was surprised that the pure numpy solution, which would seem to be equivalent to the above with raw=True , was actually a bit slower:

 %timeit -n20 np.apply_along_axis(lambda x: len(np.unique(x)), axis=1, arr = testDf.iloc[:,1:].values) #20 loops, best of 3: 1.04 s per loop

Finally, I also tried to wrap the data to make the number of columns different , which in my opinion could be more efficient (at least for DataFrame.apply() , but there was no significant difference.

 %timeit -n20 testDf.iloc[:,1:].T.apply(lambda x: len(np.unique(x)), raw=True) #20 loops, best of 3: 712 ms per loop *best so far* %timeit -n20 np.apply_along_axis(lambda x: len(np.unique(x)), axis=0, arr = testDf.iloc[:,1:].values.T) # 20 loops, best of 3: 1.13 s per loop

So far, my best solution is the weird combination of df.apply from len(np.unique()) , but what else should I try?

+6

performance python numpy pandas distinct-values

C8H10N4O2 Aug 4 '16 at 2:06

source share

3 answers

pandas.melt with DataFrame.groupby and groupby.SeriesGroupBy.nunique seems to remove other solutions:

 %timeit -n20 pd.melt(testDf, id_vars ='custId').groupby('custId').value.nunique() #20 loops, best of 3: 67.3 ms per loop

+2

C8H10N4O2 Aug 4 '16 at 14:22

source share

You do not need custId . I would stack , then groupby

 testDf.iloc[:, 1:].stack().groupby(level=0).nunique()

+1

piRSquared Aug 4 '16 at 14:45

source share

ayhan · Accepted Answer · 2016-08-04T15:49:30+0000

I understand that nunique is optimized for large series. Here you have only 3 days. Comparing each column with the others looks faster:

 testDf = genSampleData(100000,3, drinkIndex) days = testDf.columns[1:] %timeit testDf.iloc[:, 1:].stack().groupby(level=0).nunique() 10 loops, best of 3: 46.8 ms per loop %timeit pd.melt(testDf, id_vars ='custId').groupby('custId').value.nunique() 10 loops, best of 3: 47.6 ms per loop %%timeit testDf['nunique'] = 1 for col1, col2 in zip(days, days[1:]): testDf['nunique'] += ~((testDf[[col2]].values == testDf.ix[:, 'day1':col1].values)).any(axis=1) 100 loops, best of 3: 3.83 ms per loop

It loses its edge when you, of course, add more columns. For different numbers of columns (same order: stack().groupby() , pd.melt().groupby() and loops):

 10 columns: 143ms, 161ms, 30.9ms 50 columns: 749ms, 968ms, 635ms 100 columns: 1.52s, 2.11s, 2.33s

Effective counter other than DataFrame columns grouped by row

More articles: