Counting unique elements in a numpy array: why is scipy.stats.itemfreq so slow?

I am trying to count unique values ​​in a numpy array.

import numpy as np from collections import defaultdict import scipy.stats import time x = np.tile([1,2,3,4,5,6,7,8,9,10],20000) for i in [44,22,300,403,777,1009,800]: x[i] = 11 def getCounts(x): counts = defaultdict(int) for item in x: counts[item] += 1 return counts flist = [getCounts, scipy.stats.itemfreq] for f in flist: print f t1 = time.time() y = f(x) t2 = time.time() print y print '%.5f sec' % (t2-t1) 

I could not find the built-in function first, so I wrote getCounts() ; then I found scipy.stats.itemfreq , so I thought that I would use this instead. But it is slow! This is what I get on my PC. Why is it so slow compared to such a simple handwritten function?

 <function getCounts at 0x0000000013C78438> defaultdict(<type 'int'>, {1: 19998, 2: 20000, 3: 19999, 4: 19999, 5: 19999, 6: 20000, 7: 20000, 8: 19999, 9: 20000, 10: 19999, 11: 7}) 0.04700 sec <function itemfreq at 0x0000000013C5D208> [[ 1.00000000e+00 1.99980000e+04] [ 2.00000000e+00 2.00000000e+04] [ 3.00000000e+00 1.99990000e+04] [ 4.00000000e+00 1.99990000e+04] [ 5.00000000e+00 1.99990000e+04] [ 6.00000000e+00 2.00000000e+04] [ 7.00000000e+00 2.00000000e+04] [ 8.00000000e+00 1.99990000e+04] [ 9.00000000e+00 2.00000000e+04] [ 1.00000000e+01 1.99990000e+04] [ 1.10000000e+01 7.00000000e+00]] 2.04100 sec 
+7
python numpy scipy
source share
3 answers

If you can use numpy 1.9, you can use numpy.unique with argument return_counts=True . I.e.

 unique_items, counts = np.unique(x, return_counts=True) 

In fact, itemfreq been updated to use np.unique , but scipy currently supports numpy versions up to 1.5, so it does not use the return_counts argument.

Here's the full implementation of itemfreq in scipy 0.14:

 def itemfreq(a): items, inv = np.unique(a, return_inverse=True) freq = np.bincount(inv) return np.array([items, freq]).T 
+14
source share

Firstly, time.time is the wrong function used during synchronization, as it measures the time of the wall clock, not the processor time (see this question ). Ideally, you should use the timeit module, but time.clock also better.

In addition, it seems that you can use an outdated version of scipy. I am using Python 3.4 and scipy 0.14.0 and these are my timings:

 x = np.tile([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 20000) for i in [44, 22, 300, 403, 777, 1009, 800]: x[i] = 11 %timeit getCounts(x) # 10 loops, best of 3: 55.6 ms per loop %timeit scipy.stats.itemfreq(x) # 10 loops, best of 3: 20.8 ms per loop %timeit collections.Counter(x) # 10 loops, best of 3: 39.9 ms per loop %timeit np.unique(x, return_counts=True) # 100 loops, best of 3: 4.13 ms per loop 
+3
source share

Thanks for answers. I cannot use numpy 1.9 yet or scipy 0.14 due to some module conflicts in my application, but it looks like the new scipy.stats.itemfreq is much faster:

 import numpy as np from collections import defaultdict, Counter import scipy.stats import time import timeit x = np.tile([1,2,3,4,5,6,7,8,9,10],20000) for i in [44,22,300,403,777,1009,800]: x[i] = 11 def getCounts(x): counts = defaultdict(int) for item in x: counts[item] += 1 return counts def itemfreq_scipy14(x): '''this is how itemfreq works in 0.14: https://github.com/scipy/scipy/commit/7e04d6630f229693cca3522b62aa16226f174053 ''' items, inv = np.unique(x, return_inverse=True) freq = np.bincount(inv) return np.array([items, freq]).T flist = [getCounts, scipy.stats.itemfreq, np.bincount, itemfreq_scipy14, Counter] for f in flist: print f print timeit.timeit(lambda: f(x),number=3) 

which is displayed on my pc:

 <function getCounts at 0x0000000013F8EB38> 0.148138969181 <function itemfreq at 0x0000000013C5D208> 6.15385023664 <built-in function bincount> 0.00313706656675 <function itemfreq_scipy14 at 0x0000000013F8EDD8> 0.0757223407165 <class 'collections.Counter'> 0.255281199559 
0
source share

All Articles