Duplicate averages from two paired lists in Python using NumPy

In the past, I came across averaging of two paired lists , and I successfully used the provided answers.

However, with large (over 20,000) elements, the procedure is somewhat slow, and I was wondering if it would be faster to use NumPy.

I start with two lists, one of the floats and one of the lines:

names = ["a", "b", "b", "c", "d", "e", "e"] values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01] 

I am trying to calculate the average of the same values, so after applying it, I get:

 result_names = ["a", "b", "c", "d", "e"] result_values = [1.2, 4.4, 2.0, 5.67, 8.54] 

I’ll give you two lists as an example, but it’s enough to have a list of tuples (name, value) :

 result = [("a", 1.2), ("b", 4.4), ("d", 5.67), ("e", 8.54)] 

What is the best way to do this with NumPy?

+4
source share
4 answers

With numpy, you can write something yourself or use the groupby functions (rec_groupby function from matplotlib.mlab, but this is much slower). For more powerful groupby functions, maybe see pandas ), and I compared it with Michael Dunn's answer with a dictionary:

 import numpy as np import random from matplotlib.mlab import rec_groupby listA = [random.choice("abcdef") for i in range(20000)] listB = [20 * random.random() for i in range(20000)] names = np.array(listA) values = np.array(listB) def f_dict(listA, listB): d = {} for a, b in zip(listA, listB): d.setdefault(a, []).append(b) avg = [] for key in d: avg.append(sum(d[key])/len(d[key])) return d.keys(), avg def f_numpy(names, values): result_names = np.unique(names) result_values = np.empty(result_names.shape) for i, name in enumerate(result_names): result_values[i] = np.mean(values[names == name]) return result_names, result_values 

This is the result for three:

 In [2]: f_dict(listA, listB) Out[2]: (['a', 'c', 'b', 'e', 'd', 'f'], [9.9003182717213765, 10.077784850173568, 9.8623915728699636, 9.9790599744319319, 9.8811096512807097, 10.118695410115953]) In [3]: f_numpy(names, values) Out[3]: (array(['a', 'b', 'c', 'd', 'e', 'f'], dtype='|S1'), array([ 9.90031827, 9.86239157, 10.07778485, 9.88110965, 9.97905997, 10.11869541])) In [7]: rec_groupby(struct_array, ('names',), (('values', np.mean, 'resvalues'),)) Out[7]: rec.array([('a', 9.900318271721376), ('b', 9.862391572869964), ('c', 10.077784850173568), ('d', 9.88110965128071), ('e', 9.979059974431932), ('f', 10.118695410115953)], dtype=[('names', '|S1'), ('resvalues', '<f8')]) 

And it seems that numpy is slightly faster for this test (and the given groupby function is much slower):

 In [32]: %timeit f_dict(listA, listB) 10 loops, best of 3: 23 ms per loop In [33]: %timeit f_numpy(names, values) 100 loops, best of 3: 9.78 ms per loop In [8]: %timeit rec_groupby(struct_array, ('names',), (('values', np.mean, 'values'),)) 1 loops, best of 3: 203 ms per loop 
+4
source

Perhaps the numpy solution is more complicated than you need. Without doing anything unusual, I found the following “flash-like” (as in, there was no noticeable expectation with 20,000 items in the list):

 import random listA = [random.choice("abcdef") for i in range(20000)] listB = [20 * random.random() for i in range(20000)] d = {} for a, b in zip(listA, listB): d.setdefault(a, []).append(b) for key in d: print key, sum(d[key])/len(d[key]) 

Your distance may vary depending on whether 20,000 is a typical length for your lists, and whether you do it just a couple of times in a script or whether you do it hundreds / thousands of times.

+3
source

A bit late for the party, but seeing that numpy is still missing this feature, here is my best attempt at a solution with pure numpy to achieve key grouping. This should be much faster than other proposed solutions for problem sets of significant size. The key here is the excellent shrinking functionality.

 import numpy as np def group(key, value): """ group the values by key returns the unique keys, their corresponding per-key sum, and the keycounts """ #upcast to numpy arrays key = np.asarray(key) value = np.asarray(value) #first, sort by key I = np.argsort(key) key = key[I] value = value[I] #the slicing points of the bins to sum over slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1)) #first entry of each bin is a unique key unique_keys = key[slices] #sum over the slices specified by index per_key_sum = np.add.reduceat(value, slices) #number of counts per key is the difference of our slice points. cap off with number of keys for last bin key_count = np.diff(np.append(slices, len(key))) return unique_keys, per_key_sum, key_count names = ["a", "b", "b", "c", "d", "e", "e"] values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01] unique_keys, per_key_sum, key_count = group(names, values) print per_key_sum / key_count 
0
source

A simple solution through numpy, assuming vA0 and vB0 as numpy.arrays that are indexed using vA0.

 import numpy as np def avg_group(vA0, vB0): vA, ind, counts = np.unique(vA0, return_index=True, return_counts=True) # get unique values in vA0 vB = vB0[ind] for dup in vB[counts>1]: # store the average (one may change as wished) of original elements in vA0 reference by the unique elements in vB vB[np.where(vA==dup)] = np.average(vB0[np.where(vA0==dup)]) return vA, vB 
0
source

All Articles