Python: accelerated work for indexing

I have the following snippet that extracts the indices of all unique values ​​(hashed) in a sequence of type data with canonical indices and stores them in a dictionary as lists:

 from collections import defaultdict idx_lists = defaultdict(list) for idx, ele in enumerate(data): idx_lists[ele].append(idx) 

This seems like a fairly common use case. And it happens that 90% of the execution time of my code is spent on these few lines. This part is skipped over 10,000 times at runtime, and len(data) is between 50,000 and 100,000 each time it is executed. The number of unique elements ranges from 50 to 150.

Is there a faster way, possibly a vectorized / c-extended one (e.g. numpy or pandas methods) that achieves the same?

Thank you very much.

+6
source share
3 answers

I found this question quite interesting, and so far I have not been able to achieve significant improvement over the other proposed methods, I found a pure numpy method that was slightly faster than the other proposed methods.

 import numpy as np import pandas as pd from collections import defaultdict data = np.random.randint(0, 10**2, size=10**5) series = pd.Series(data) def get_values_and_indicies(input_data): input_data = np.asarray(input_data) sorted_indices = input_data.argsort() # Get the sorted indices # Get the sorted data so we can see where the values change sorted_data = input_data[sorted_indices] # Find the locations where the values change and include the first and last values run_endpoints = np.concatenate(([0], np.where(sorted_data[1:] != sorted_data[:-1])[0] + 1, [len(input_data)])) # Get the unique values themselves unique_vals = sorted_data[run_endpoints[:-1]] # Return the unique values along with the indices associated with that value return {unique_vals[i]: sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist() for i in range(num_values)} def by_dd(input_data): idx_lists = defaultdict(list) for idx, ele in enumerate(input_data): idx_lists[ele].append(idx) return idx_lists def by_pand1(input_data): idx_lists = defaultdict(list) return {k: v.tolist() for k,v in series.groupby(input_data).indices.items()} def by_pand2(input_data): return series.groupby(input_data).indices def data_to_idxlists(input_data): u, ixs = np.unique(input_data, return_inverse=True) return {u: np.nonzero(ixs==i) for i, u in enumerate(u)} def data_to_idxlists_unique(input_data): sorting_ixs = np.argsort(input_data) uniques, unique_indices = np.unique(input_data[sorting_ixs], return_index = True) return {u: sorting_ixs[start:stop] for u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[None])} 

Resulting timings (from the fastest to the slowest):

 >>> %timeit get_values_and_indicies(data) 100 loops, best of 3: 4.25 ms per loop >>> %timeit by_pand2(series) 100 loops, best of 3: 5.22 ms per loop >>> %timeit data_to_idxlists_unique(data) 100 loops, best of 3: 6.23 ms per loop >>> %timeit by_pand1(series) 100 loops, best of 3: 10.2 ms per loop >>> %timeit data_to_idxlists(data) 100 loops, best of 3: 15.5 ms per loop >>> %timeit by_dd(data) 10 loops, best of 3: 21.4 ms per loop 

and it should be noted that unlike by_pand2, it displays the list of lists given in the example. If you prefer to return defaultdict , you can simply change the last time to return defaultdict(list, ((unique_vals[i], sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist()) for i in range(num_values))) , which increased the total time in my tests to 4.4 ms.

Finally, I must note that this temporary data is data sensitive. When I used only 10 different values, I got:

  • get_values_and_indicies: 4.34 ms per cycle
  • data_to_idxlists_unique: 4.42 ms per cycle
  • by_pand2: 4.83 ms per cycle
  • data_to_idxlists: 6.09 ms per cycle
  • by_pand1: 9.39 ms per cycle
  • by_dd: 22.4 ms per cycle

and if I used 10,000 different values, I got:

  • get_values_and_indicies: 7.00 ms per cycle
  • data_to_idxlists_unique: 14.8 ms per cycle
  • by_dd: 29.8 ms per cycle
  • by_pand2: 47.7 ms per cycle
  • by_pand1: 67.3 ms per cycle
  • data_to_idxlists: 869 ms per cycle
+1
source

Not as impressive as I was hoping for the original (there is still a fair bit of pure Python in the group code), but you could cut the time down 2-4 times, depending on how much you care about the specific final types:

 import numpy as np, pandas as pd from collections import defaultdict def by_dd(data): idx_lists = defaultdict(list) for idx, ele in enumerate(data): idx_lists[ele].append(idx) return idx_lists def by_pand1(data): return {k: v.tolist() for k,v in data.groupby(data.values).indices.items()} def by_pand2(data): return data.groupby(data.values).indices data = pd.Series(np.random.randint(0, 100, size=10**5)) 

gives me

 >>> %timeit by_dd(data) 10 loops, best of 3: 42.9 ms per loop >>> %timeit by_pand1(data) 100 loops, best of 3: 18.2 ms per loop >>> %timeit by_pand2(data) 100 loops, best of 3: 11.5 ms per loop 
+5
source

Although this is not an ideal solution (this is O (NlogN) instead of O (N)), a much faster, vector way to do this:

 def data_to_idxlists(data): sorting_ixs = np.argsort(data) uniques, unique_indices = np.unique(data[sorting_ixs], return_index = True) return {u: sorting_ixs[start:stop] for u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[None])} 

Another solution, which is O (N * U), (where U is the number of unique groups):

 def data_to_idxlists(data): u, ixs = np.unique(data, return_inverse=True) return {u: np.nonzero(ixs==i) for i, u in enumerate(u)} 
+2
source

All Articles