What is a faster way to get the location of unique strings in numpy

I have a list of unique strings and another large data array (e.g. test_rows). I was wondering if there is a faster way to get the location of each unique row in the data. The fastest way I could come up with is ...

import numpy uniq_rows = numpy.array([[0, 1, 0], [1, 1, 0], [1, 1, 1], [0, 1, 1]]) test_rows = numpy.array([[0, 1, 1], [0, 1, 0], [0, 0, 0], [1, 1, 0], [0, 1, 0], [0, 1, 1], [0, 1, 1], [1, 1, 1], [1, 1, 0], [1, 1, 1], [0, 1, 0], [0, 0, 0], [1, 1, 0]]) # this gives me the indexes of each group of unique rows for row in uniq_rows.tolist(): print row, numpy.where((test_rows == row).all(axis=1))[0] 

Will print ...

 [0, 1, 0] [ 1 4 10] [1, 1, 0] [ 3 8 12] [1, 1, 1] [7 9] [0, 1, 1] [0 5 6] 

Is there a more or less numpythonic (not sure if this word exists) way to do this? I searched for the numpy group function but could not find it. Basically, for any incoming dataset, I need the fastest way to get the location of each unique row in this dataset. An incoming dataset will not always have each unique row or the same number.

EDIT: This is a simple example. In my application, the numbers will not only be zeros and ones, they can be from 0 to 32000. The uniq lines can be 4 to 128 lines long, and the size of test_rows can be hundreds of thousands.

+7
python numpy scipy
source share
7 answers

Numpy

In version 1.13 of numpy, you can use numpy.unique , e.g. np.unique(test_rows, return_counts=True, return_index=True, axis=1)

Pandas

 df = pd.DataFrame(test_rows) uniq = pd.DataFrame(uniq_rows) 

Uniq

  0 1 2 0 0 1 0 1 1 1 0 2 1 1 1 3 0 1 1 

Or you can automatically generate unique rows from an incoming DataFrame

 uniq_generated = df.drop_duplicates().reset_index(drop=True) 

gives

  0 1 2 0 0 1 1 1 0 1 0 2 0 0 0 3 1 1 0 4 1 1 1 

and then find him

 d = dict() for idx, row in uniq.iterrows(): d[idx] = df.index[(df == row).all(axis=1)].values 

This is about the same as your where method

d

 {0: array([ 1, 4, 10], dtype=int64), 1: array([ 3, 8, 12], dtype=int64), 2: array([7, 9], dtype=int64), 3: array([0, 5, 6], dtype=int64)} 
+1
source share

With np.unique from v1.13 (downloaded from source link in latest documentation, https://github.com/numpy/numpy/blob/master/numpy/lib/arraysetops.py#L112-L247 )

 In [157]: aset.unique(test_rows, axis=0,return_inverse=True,return_index=True) Out[157]: (array([[0, 0, 0], [0, 1, 0], [0, 1, 1], [1, 1, 0], [1, 1, 1]]), array([2, 1, 0, 3, 7], dtype=int32), array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32)) In [158]: a,b,c=_ In [159]: c Out[159]: array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32) In [164]: from collections import defaultdict In [165]: dd = defaultdict(list) In [166]: for i,v in enumerate(c): ...: dd[v].append(i) ...: In [167]: dd Out[167]: defaultdict(list, {0: [2, 11], 1: [1, 4, 10], 2: [0, 5, 6], 3: [3, 8, 12], 4: [7, 9]}) 

or indexing a dictionary with unique lines (like a hashed tuple):

 In [170]: dd = defaultdict(list) In [171]: for i,v in enumerate(c): ...: dd[tuple(a[v])].append(i) ...: In [172]: dd Out[172]: defaultdict(list, {(0, 0, 0): [2, 11], (0, 1, 0): [1, 4, 10], (0, 1, 1): [0, 5, 6], (1, 1, 0): [3, 8, 12], (1, 1, 1): [7, 9]}) 
+1
source share

Approach No. 1

Here's one approach, not sure about the level of "NumPythonic-ness", although to such a complex problem -

 def get1Ds(a, b): # Get 1D views of each row from the two inputs # check that casting to void will create equal size elements assert a.shape[1:] == b.shape[1:] assert a.dtype == b.dtype # compute dtypes void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1])) # convert to 1d void arrays a = np.ascontiguousarray(a) b = np.ascontiguousarray(b) a_void = a.reshape(a.shape[0], -1).view(void_dt).ravel() b_void = b.reshape(b.shape[0], -1).view(void_dt).ravel() return a_void, b_void def matching_row_indices(uniq_rows, test_rows): A, B = get1Ds(uniq_rows, test_rows) validA_mask = np.in1d(A,B) sidx_A = A.argsort() validA_mask = validA_mask[sidx_A] sidx = B.argsort() sortedB = B[sidx] split_idx = np.flatnonzero(sortedB[1:] != sortedB[:-1])+1 all_split_indx = np.split(sidx, split_idx) match_mask = np.in1d(B,A)[sidx] valid_mask = np.logical_or.reduceat(match_mask, np.r_[0, split_idx]) locations = [e for i,e in enumerate(all_split_indx) if valid_mask[i]] return uniq_rows[sidx_A[validA_mask]], locations 

Scope of improvement (performance):

  • np.split can be replaced with for-loop for splitting using slicing .
  • np.r_ can be replaced with np.concatenate .

Run Example -

 In [331]: unq_rows, idx = matching_row_indices(uniq_rows, test_rows) In [332]: unq_rows Out[332]: array([[0, 1, 0], [0, 1, 1], [1, 1, 0], [1, 1, 1]]) In [333]: idx Out[333]: [array([ 1, 4, 10]),array([0, 5, 6]),array([ 3, 8, 12]),array([7, 9])] 

Approach # 2

Another approach that may affect the initial installation overhead from the previous one and using get1Ds from it would be -

 A, B = get1Ds(uniq_rows, test_rows) idx_group = [] for row in A: idx_group.append(np.flatnonzero(B == row)) 
0
source share

This will complete the task:

 import numpy as np uniq_rows = np.array([[0, 1, 0], [1, 1, 0], [1, 1, 1], [0, 1, 1]]) test_rows = np.array([[0, 1, 1], [0, 1, 0], [0, 0, 0], [1, 1, 0], [0, 1, 0], [0, 1, 1], [0, 1, 1], [1, 1, 1], [1, 1, 0], [1, 1, 1], [0, 1, 0], [0, 0, 0], [1, 1, 0]]) indices=np.where(np.sum(np.abs(np.repeat(uniq_rows,len(test_rows),axis=0)-np.tile(test_rows,(len(uniq_rows),1))),axis=1)==0)[0] loc=indices//len(test_rows) indices=indices-loc*len(test_rows) res=[[] for i in range(len(uniq_rows))] for i in range(len(indices)): res[loc[i]].append(indices[i]) print(res) [[1, 4, 10], [3, 8, 12], [7, 9], [0, 5, 6]] 

This will work for all cases, including cases where not all rows in uniq_rows are present in test_rows . However, if you somehow know that they are all present, you can replace the part

 res=[[] for i in range(len(uniq_rows))] for i in range(len(indices)): res[loc[i]].append(indices[i]) 

only with string:

 res=np.split(indices,np.where(np.diff(loc)>0)[0]+1) 

Thus, avoiding cycles completely.

0
source share

Not very "numpythonic", but for a small initial cost, we can make a dict with keys as a tuple of your string and a list of indices:

 test_rowsdict = {} for i,j in enumerate(test_rows): test_rowsdict.setdefault(tuple(j),[]).append(i) test_rowsdict {(0, 0, 0): [2, 11], (0, 1, 0): [1, 4, 10], (0, 1, 1): [0, 5, 6], (1, 1, 0): [3, 8, 12], (1, 1, 1): [7, 9]} 

Then you can filter based on your uniq_rows with a quick search dict: test_rowsdict[tuple(row)] :

 out = [] for i in uniq_rows: out.append((i, test_rowsdict.get(tuple(i),[]))) 

For your data, I get 16us for search only and 66us for build and search compared to 95us for your np.where solution.

0
source share

The numpy_indexed package (disclaimer: I am the author) was created to solve such problems in an elegant and efficient way:

 import numpy_indexed as npi indices = np.arange(len(test_rows)) unique_test_rows, index_groups = npi.group_by(test_rows, indices) 

If you do not need the indices of all the lines, but only those that are present in test_rows, npi has a bunch of simple ways to solve this problem; fi:

 subset_indices = npi.indices(unique_test_rows, unique_rows) 

As a support; it might be useful to take a look at the examples in the npi library; in my experience, most of the time people ask such a question, these grouped indices are just a means to an end, not the end goal of a calculation. Most likely, using the functionality in npi, you can achieve this goal more efficiently without explicitly calculating these indices. Do not worry to pay more attention to your problem?

EDIT: if you arrays are really so large and always consist of a small number of columns with binary values, packing them with the following encoding can significantly increase efficiency:

 def encode(rows): return (rows * [[2**i for i in range(rows.shape[1])]]).sum(axis=1, dtype=np.uint8) 
0
source share

There are many solutions here, but I am adding one with vanilla numpy. In most cases, numpy will be faster than understanding lists and dictionaries, although broadcasting an array can cause memory problems if large arrays are used.

 np.where((uniq_rows[:, None, :] == test_rows).all(2)) 

Wonderfully simple, huh? This returns a tuple of unique row indices and the corresponding test string.

  (array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]), array([ 1, 4, 10, 3, 8, 12, 7, 9, 0, 5, 6])) 

How it works:

 (uniq_rows[:, None, :] == test_rows) 

Uses broadcast array to compare each test_rows element with each row in uniq_rows . This results in a 4x13x3 array. all used to determine which rows are equal (all comparisons are correct). Finally, where returns the indices of these strings.

0
source share

All Articles