I found this question quite interesting, and so far I have not been able to achieve significant improvement over the other proposed methods, I found a pure numpy method that was slightly faster than the other proposed methods.
import numpy as np import pandas as pd from collections import defaultdict data = np.random.randint(0, 10**2, size=10**5) series = pd.Series(data) def get_values_and_indicies(input_data): input_data = np.asarray(input_data) sorted_indices = input_data.argsort()
Resulting timings (from the fastest to the slowest):
>>> %timeit get_values_and_indicies(data) 100 loops, best of 3: 4.25 ms per loop >>> %timeit by_pand2(series) 100 loops, best of 3: 5.22 ms per loop >>> %timeit data_to_idxlists_unique(data) 100 loops, best of 3: 6.23 ms per loop >>> %timeit by_pand1(series) 100 loops, best of 3: 10.2 ms per loop >>> %timeit data_to_idxlists(data) 100 loops, best of 3: 15.5 ms per loop >>> %timeit by_dd(data) 10 loops, best of 3: 21.4 ms per loop
and it should be noted that unlike by_pand2, it displays the list of lists given in the example. If you prefer to return defaultdict , you can simply change the last time to return defaultdict(list, ((unique_vals[i], sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist()) for i in range(num_values))) , which increased the total time in my tests to 4.4 ms.
Finally, I must note that this temporary data is data sensitive. When I used only 10 different values, I got:
- get_values_and_indicies: 4.34 ms per cycle
- data_to_idxlists_unique: 4.42 ms per cycle
- by_pand2: 4.83 ms per cycle
- data_to_idxlists: 6.09 ms per cycle
- by_pand1: 9.39 ms per cycle
- by_dd: 22.4 ms per cycle
and if I used 10,000 different values, I got:
- get_values_and_indicies: 7.00 ms per cycle
- data_to_idxlists_unique: 14.8 ms per cycle
- by_dd: 29.8 ms per cycle
- by_pand2: 47.7 ms per cycle
- by_pand1: 67.3 ms per cycle
- data_to_idxlists: 869 ms per cycle
source share