A quick algorithm for finding indexes where multiple arrays have the same value

Question

A quick algorithm for finding indexes where multiple arrays have the same value

I am looking for ways to speed up (or replace) my data grouping algorithm.

I have a list of numpy arrays. I want to create a new numpy array, so that each element of this array will be the same for each index, where the source arrays are the same. (And another, if it is not).

That sounds awkward, so an example:

# Test values: values = [ np.array([10, 11, 10, 11, 10, 11, 10]), np.array([21, 21, 22, 22, 21, 22, 23]), ] # Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4]) # * *

Note that the marked elements (indices 0 and 4) of the expected result have the same value ( 0 ), because the original two arrays were the same (namely 10 and 21 ). Similarly for elements with indices 3 and 5 ( 3 ).

The algorithm must deal with an arbitrary number of (uniformly) input arrays, and also return for each resulting number the values of the original arrays to which they correspond. (So, for this example, “3” refers to (11, 22) .)

Here is my current algorithm:

 import numpy as np def groupify(values): group = np.zeros((len(values[0]),), dtype=np.int64) - 1 # Magic number: -1 means ungrouped. group_meanings = {} next_hash = 0 matching = np.ones((len(values[0]),), dtype=bool) while any(group == -1): this_combo = {} matching[:] = (group == -1) first_ungrouped_idx = np.where(matching)[0][0] for curr_id, value_array in enumerate(values): needed_value = value_array[first_ungrouped_idx] matching[matching] = value_array[matching] == needed_value this_combo[curr_id] = needed_value # Assign all of the found elements to a new group group[matching] = next_hash group_meanings[next_hash] = this_combo next_hash += 1 return group, group_meanings

Note that the expression value_array[matching] == needed_value is evaluated many times for each individual index from which slowness originates.

I'm not sure my algorithm can speed up much more, but I'm also not sure why this is the best algorithm. Is there a better way to do this?

+7

performance python numpy

acdr Jun 23 '16 at 9:28

source share

4 answers

This should work and should be significantly faster since we use broadcasting and numpy initially fast logical comparisons:

 import numpy as np # Test values: values = [ np.array([10, 11, 10, 11, 10, 11, 10]), np.array([21, 21, 22, 22, 21, 22, 23]), ] # Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4]) # for every value in values, check where duplicate values occur same_mask = [val[:,np.newaxis] == val[np.newaxis,:] for val in values] # get the conjunction of all those tests conjunction = np.logical_and.reduce(same_mask) # ignore the diagonal conjunction[np.diag_indices_from(conjunction)] = False # initialize the labelled array with nans (used as flag) labelled = np.empty(values[0].shape) labelled.fill(np.nan) # keep track of labelled value val = 0 for k, row in enumerate(conjunction): if np.isnan(labelled[k]): # this element has not been labelled yet labelled[k] = val # so label it labelled[row] = val # and label every element satisfying the test val += 1 print(labelled) # outputs [ 0. 1. 2. 3. 0. 3. 4.]

This is about 1.5 times faster than your version when working with two arrays, but I suspect that the acceleration should be better for more arrays.

+2

EelkeSpaak Jun 23 '16 at 9:44

source share

The numpy_indexed package (disclaimer: I am the author of it) contains generalized variants of numpy arrayset operations that can be used to solve your problem in an elegant and efficient (vectorized) way:

 import numpy_indexed as npi unique_values, labels = npi.unique(tuple(values), return_inverse=True)

The above will work for arbitrary combinations of types, but as an alternative below it will be even more effective if the values are a list of many arrays of the same type:

 unique_values, labels = npi.unique(np.asarray(values), axis=1, return_inverse=True)

+1

Eelco hoogendoorn Jun 23 '16 at 11:48

source share

If I understand correctly, you are trying to use hash values according to columns. It is best to convert the columns to arbitrary values, and then find the hashes of them.

So, you really want a hash on list(np.array(values).T) .

This feature is already built into Pandas. You do not need to write. The only problem is that it accepts a list of values without additional lists inside it. In this case, you can simply convert the internal list to a string map(str, list(np.array(values).T)) and factor it!

 >>> import pandas as pd >>> pd.factorize(map(str, list(np.array(values).T))) (array([0, 1, 2, 3, 0, 3, 4]), array(['[10 21]', '[11 21]', '[10 22]', '[11 22]', '[10 23]'], dtype=object))

I converted the list of arrays to an array and then to a string ...

-one

ssm Jun 23 '16 at 9:39

source share

Divakar · Accepted Answer · 2016-06-23T11:09:36+0000

Cracks finally for a vectorized solution! This was an interesting problem. The problem was that we had to mark each pair of values taken from the corresponding elements of the list array. Then we must mark each such pair based on their uniqueness among Othet pairs. This way we can use np.unique , abusing all optional arguments, and finally do some extra work to keep order for the final output. Here the implementation is mainly carried out in three stages -

 # Stack as a 2D array with each pair from values as a column each. # Convert to linear index equivalent considering each column as indexing tuple arr = np.vstack(values) idx = np.ravel_multi_index(arr,arr.max(1)+1) # Do the heavy work with np.unique to give us : # 1. Starting indices of unique elems, # 2. Srray that has unique IDs for each element in idx, and # 3. Group ID counts _,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \ return_inverse=True,return_counts=True) # Best part happens here : Use mask to ignore the repeated elems and re-tag # each unqID using argsort() of masked elements from idx mask = ~np.in1d(unqID,np.where(count>1)[0]) mask[unq_start_idx] = 1 out = idx[mask].argsort()[unqID]

Runtime test

Compare the proposed vectorized approach with the source code. Since the proposed code only gives us the identifiers of the groups, so for a fair comparative test, let's just trim parts from the source code that are not used for this. So here are the function definitions -

 def groupify(values): # Original code group = np.zeros((len(values[0]),), dtype=np.int64) - 1 next_hash = 0 matching = np.ones((len(values[0]),), dtype=bool) while any(group == -1): matching[:] = (group == -1) first_ungrouped_idx = np.where(matching)[0][0] for curr_id, value_array in enumerate(values): needed_value = value_array[first_ungrouped_idx] matching[matching] = value_array[matching] == needed_value # Assign all of the found elements to a new group group[matching] = next_hash next_hash += 1 return group def groupify_vectorized(values): # Proposed code arr = np.vstack(values) idx = np.ravel_multi_index(arr,arr.max(1)+1) _,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \ return_inverse=True,return_counts=True) mask = ~np.in1d(unqID,np.where(count>1)[0]) mask[unq_start_idx] = 1 return idx[mask].argsort()[unqID]

Listed results with large arrays -

 In [345]: # Input list with random elements ...: values = [item for item in np.random.randint(10,40,(10,10000))] In [346]: np.allclose(groupify(values),groupify_vectorized(values)) Out[346]: True In [347]: %timeit groupify(values) 1 loops, best of 3: 4.02 s per loop In [348]: %timeit groupify_vectorized(values) 100 loops, best of 3: 3.74 ms per loop

A quick algorithm for finding indexes where multiple arrays have the same value

More articles: