What is the fastest way (within reason of reasonable pythonicity) to count different column values โโof the same dtype for each row in a DataFrame ?
Details: I have a DataFrame categorical results for an item (in rows) by day (in columns), similar to what is generated by the following.
import numpy as np import pandas as pd def genSampleData(custCount, dayCount, discreteChoices): """generate example dataset""" np.random.seed(123) return pd.concat([ pd.DataFrame({'custId':np.array(range(1,int(custCount)+1))}), pd.DataFrame( columns = np.array(['day%d' % x for x in range(1,int(dayCount)+1)]), data = np.random.choice(a=np.array(discreteChoices), size=(int(custCount), int(dayCount))) )], axis=1)
For example, if the data set indicates which drink each customer orders each time they shop, I would like to know the number of different drinks for each customer.
# notional discrete choice outcome drinkOptions, drinkIndex = np.unique(['coffee','tea','juice','soda','water'], return_inverse=True)
What I tried: The datasets in this use case will have much more items than days ( testDf example below), so I tried to find the most efficient equally:
testDf = genSampleData(100000,3, drinkIndex)
To improve my initial attempt, note that pandas.DataFrame.apply () takes an argument:
If raw=True , then the function passed will receive ndarray objects instead. If you simply apply the NumPy reduction function, this will lead to much higher performance.
This reduced execution time by more than half:
%timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1, raw=True)
I was surprised that the pure numpy solution, which would seem to be equivalent to the above with raw=True , was actually a bit slower:
%timeit -n20 np.apply_along_axis(lambda x: len(np.unique(x)), axis=1, arr = testDf.iloc[:,1:].values)
Finally, I also tried to wrap the data to make the number of columns different , which in my opinion could be more efficient (at least for DataFrame.apply() , but there was no significant difference.
%timeit -n20 testDf.iloc[:,1:].T.apply(lambda x: len(np.unique(x)), raw=True)
So far, my best solution is the weird combination of df.apply from len(np.unique()) , but what else should I try?