Indexes that intersect and sort two numpy arrays

I have two numpy arrays of integers whose length is several hundred million. Inside each array, the values ​​are unique, and each of them is unsorted at first.

I would like the indexes to correspond to each ordered intersection. For example:

x = np.array([4, 1, 10, 5, 8, 13, 11]) y = np.array([20, 5, 4, 9, 11, 7, 25]) 

Then their sorted intersection [4, 5, 11] , and therefore we want the indices to turn each of x and y into this array, so we want it to be returned:

 mx = np.array([0, 3, 6]) my = np.array([2, 1, 4]) 

since then x[mx] == y[my] == np.intersect1d(x, y)

The only solution we have so far includes three different arguments, so it seems that this is unlikely to be optimal.

Each value represents a galaxy, in case the problem becomes more exciting.

+8
optimization python sorting numpy set-intersection
source share
4 answers

Here's an option based on the implementation of intersect1d , which is pretty simple. This requires a single call to argsort .

Admittedly, a simplified passage passes.

 import numpy as np def my_intersect(x, y): """my_intersect(x, y) -> xm, ym x, y: 1-d arrays of unique values xm, ym: indices into x and y giving sorted intersection """ # basic idea taken from numpy.lib.arraysetops.intersect1d aux = np.concatenate((x, y)) sidx = aux.argsort() # Note: intersect1d uses aux[:-1][aux[1:]==aux[:-1]] here - I don't know why the first [:-1] is necessary inidx = aux[sidx[1:]] == aux[sidx[:-1]] # quicksort is not stable, so must do some work to extract indices # (if stable, sidx[inidx.nonzero()] would be for x) # interlace the two sets of indices, and check against lengths xym = np.vstack((sidx[inidx.nonzero()], sidx[1:][inidx.nonzero()])).T.flatten() xm = xym[xym < len(x)] ym = xym[xym >= len(x)] - len(x) return xm, ym def check_my_intersect(x, y): mx, my = my_intersect(x, y) assert (x[mx] == np.intersect1d(x, y)).all() # not really necessary: np.intersect1d returns a sorted list assert (x[mx] == sorted(x[mx])).all() assert (x[mx] == y[my]).all() def random_unique_unsorted(n): while True: x = np.unique(np.random.randint(2*n, size=n)) if len(x): break np.random.shuffle(x) return x x = np.array([4, 1, 10, 5, 8, 13, 11]) y = np.array([20, 5, 4, 9, 11, 7, 25]) check_my_intersect(x, y) for i in range(20): x = random_unique_unsorted(100+i) y = random_unique_unsorted(200+i) check_my_intersect(x, y) 

Edit: comment "Note" was confusing (used ... as ellipsis of speech, forgot that this is also a Python operator).

+3
source share

For a clean numpy solution, you can do something like this:

  • Use np.unique to get unique values ​​and corresponding indices in x and y separately:

     # sorted unique values in x and y and the indices corresponding to their first # occurrences, such that u_x == x[u_idx_x] u_x, u_idx_x = np.unique(x, return_index=True) u_y, u_idx_y = np.unique(y, return_index=True) 
  • Find the intersection of unique values ​​with np.intersect1d :

     # we can assume_unique, which can be faster for large arrays i_xy = np.intersect1d(u_x, u_y, assume_unique=True) 
  • Finally, use np.in1d to select only indexes matching unique values ​​in x or y , which also appear at the intersection of x and y :

     # it is also safe to assume_unique here i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)] i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)] 

To combine all this into one function:

 def intersect_indices(x, y): u_x, u_idx_x = np.unique(x, return_index=True) u_y, u_idx_y = np.unique(y, return_index=True) i_xy = np.intersect1d(u_x, u_y, assume_unique=True) i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)] i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)] return i_idx_x, i_idx_y 

For example:

 x = np.array([4, 1, 10, 5, 8, 13, 11]) y = np.array([20, 5, 4, 9, 11, 7, 25]) i_idx_x, i_idx_y = intersect_indices(x, y) print(i_idx_x, i_idx_y) # (array([0, 3, 6]), array([2, 1, 4])) 

Speed ​​test:

 In [1]: k = 1000000 In [2]: %%timeit x, y = np.random.randint(k, size=(2, k)) intersect_indices(x, y) ....: 1 loops, best of 3: 597 ms per loop 

Update:

At first I skipped the fact that in your case both x and y contain only unique values. Given this, you can do a little better using the indirect view:

 def intersect_indices_unique(x, y): u_idx_x = np.argsort(x) u_idx_y = np.argsort(y) i_xy = np.intersect1d(x, y, assume_unique=True) i_idx_x = u_idx_x[x[u_idx_x].searchsorted(i_xy)] i_idx_y = u_idx_y[y[u_idx_y].searchsorted(i_xy)] return i_idx_x, i_idx_y 

Here's a more realistic test case, where x and y both contain unique (but partially overlapping) values:

 In [1]: n, k = 10000000, 1000000 In [2]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2)) intersect_indices(x, y) ....: 1 loops, best of 3: 593 ms per loop In [3]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2)) intersect_indices_unique(x, y) ....: 1 loops, best of 3: 453 ms per loop 

@Divakar's solution is very similar to performance:

 In [4]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2)) searchsorted_based(x, y) ....: 1 loops, best of 3: 472 ms per loop 
+3
source share

You can also use np.searchsorted , for example:

 def searchsorted_based(x,y): # Get argsort for both x and y xsort_idx = x.argsort() ysort_idx = y.argsort() # Sort x and y and store them X = x[xsort_idx] Y = y[ysort_idx] # Find positions of Y in X and the matches by the positions that # shift between 'left' and 'right' based searches. # Use the matches posotions to get corresponding argsort for X. x1 = np.searchsorted(X,Y,'left') x2 = np.searchsorted(X,Y,'right') out1 = xsort_idx[x1[x2 != x1]] # Repeat for X in Y findings y1 = np.searchsorted(Y,X,'left') y2 = np.searchsorted(Y,X,'right') out2 = ysort_idx[y1[y2 != y1]] return out1, out2 

Run Example -

 In [100]: x = np.array([4, 1, 10, 5, 8, 13, 11]) ...: y = np.array([20, 5, 4, 9, 11, 7, 25]) ...: In [101]: searchsorted_based(x,y) Out[101]: (array([0, 3, 6]), array([2, 1, 4])) 
+3
source share

Pure Python solutions using dict may work for you:

 def indices_from_values(a, intersect): idx = {value: index for index, value in enumerate(a)} return np.array([idx[x] for x in intersect]) intersect = np.intersect1d(x, y) mx = indices_from_values(x, intersect) my = indices_from_values(y, intersect) np.allclose(x[mx], y[my]) and np.allclose(x[mx], np.intersect1d(x, y)) 
0
source share

All Articles