I am looking for ways to speed up (or replace) my data grouping algorithm.
I have a list of numpy arrays. I want to create a new numpy array, so that each element of this array will be the same for each index, where the source arrays are the same. (And another, if it is not).
That sounds awkward, so an example:
# Test values: values = [ np.array([10, 11, 10, 11, 10, 11, 10]), np.array([21, 21, 22, 22, 21, 22, 23]), ] # Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4]) # * *
Note that the marked elements (indices 0 and 4) of the expected result have the same value ( 0 ), because the original two arrays were the same (namely 10 and 21 ). Similarly for elements with indices 3 and 5 ( 3 ).
The algorithm must deal with an arbitrary number of (uniformly) input arrays, and also return for each resulting number the values โโof the original arrays to which they correspond. (So, for this example, โ3โ refers to (11, 22) .)
Here is my current algorithm:
import numpy as np def groupify(values): group = np.zeros((len(values[0]),), dtype=np.int64) - 1 # Magic number: -1 means ungrouped. group_meanings = {} next_hash = 0 matching = np.ones((len(values[0]),), dtype=bool) while any(group == -1): this_combo = {} matching[:] = (group == -1) first_ungrouped_idx = np.where(matching)[0][0] for curr_id, value_array in enumerate(values): needed_value = value_array[first_ungrouped_idx] matching[matching] = value_array[matching] == needed_value this_combo[curr_id] = needed_value # Assign all of the found elements to a new group group[matching] = next_hash group_meanings[next_hash] = this_combo next_hash += 1 return group, group_meanings
Note that the expression value_array[matching] == needed_value is evaluated many times for each individual index from which slowness originates.
I'm not sure my algorithm can speed up much more, but I'm also not sure why this is the best algorithm. Is there a better way to do this?