Efficient Python implementation for comparing numpy arrays

Background

I have two numpy arrays that I would like to use to perform some comparison operations in the most efficient / fastest way. Both contain only unsigned ints.

pairs is an nx 2 x 3 array that contains a long list of paired three-dimensional coordinates (for some pairs array contains a set of pairs ...) - that is

 # full pairs array In [145]: pairs Out[145]: array([[[1, 2, 4], [3, 4, 4]], ..... [[1, 2, 5], [5, 6, 5]]]) # each entry contains a pair of 3D coordinates In [149]: pairs[0] Out[149]: array([[1, 2, 4], [3, 4, 4]]) 

positions is an nx 3 array that contains a set of three-dimensional coordinates

 In [162]: positions Out[162]: array([[ 1, 2, 4], [ 3, 4, 5], [ 5, 6, 3], [ 3, 5, 6], [ 6, 7, 5], [12, 2, 5]]) 

Purpose I want to create an array that is a subset of the pairs array, but contains ONLY records in which no more than one pair is in the array of positions, i.e. There should not be pairs where BOTH pairs lie in an array of positions. For some domain information, each pair will have at least one of the paired positions within the list of positions.

Approaches taken so far My initial naive approach was to iterate over each pair in the pairs array and subtract each of the two pair positions from the positions vector, determining if in the case of BOTH there was a match indicated by the presence of 0 in both vectors that emanate from subtraction operations:

  if (~(positions-pair[0]).any(axis=1)).any() and (~(positions-pair[1]).any(axis=1)).any(): # both members of the pair were in the positions array - # these weren't the droids we were looking for pass else: # append this set of pairs to a new matrix 

This works great and uses some vector, but probably the best way to do this?

For some other components that are sensitive to the performance of this program, I rewrote things with Cython that led to significant speedup, although in this case (at least based on a naive nested implementation for the loop) it was slightly slower than the approach described above.

If people have suggestions, I am happy to profile and report (I have the entire profiling infrastructure).

+7
performance python numpy cython
source share
2 answers

Approach No. 1

As mentioned in the question, both arrays contain only unsigned ints , which can be used to combine XYZ into an equivalent version of linear indexes that would be unique for each unique triplet t21. The implementation will look something like this:

 maxlen = np.max(pairs,axis=(0,1)) dims = np.append(maxlen[::-1][:-1].cumprod()[::-1],1) pairs1D = np.dot(pairs.reshape(-1,3),dims) positions1D = np.dot(positions,dims) mask_idx = ~(np.in1d(pairs1D,positions1D).reshape(-1,2).all(1)) out = pairs[mask_idx] 

Since you are dealing with 3D coordinates, you can also use cdist to check for identical XYZ trichts between input arrays. Two implementations with this idea are listed below.

Approach # 2

 from scipy.spatial.distance import cdist p0 = cdist(pairs[:,0,:],positions) p1 = cdist(pairs[:,1,:],positions) out = pairs[((p0==0) | (p1==0)).sum(1)!=2] 

Approach No. 3

 mask_idx = ~((cdist(pairs.reshape(-1,3),positions)==0).any(1).reshape(-1,2).all(1)) out = pairs[mask_idx] 

Runtime Tests -

 In [80]: n = 5000 ...: pairs = np.random.randint(0,100,(n,2,3)) ...: positions= np.random.randint(0,100,(n,3)) ...: In [81]: def cdist_split(pairs,positions): ...: p0 = cdist(pairs[:,0,:],positions) ...: p1 = cdist(pairs[:,1,:],positions) ...: return pairs[((p0==0) | (p1==0)).sum(1)!=2] ...: ...: def cdist_merged(pairs,positions): ...: mask_idx = ~((cdist(pairs.reshape(-1,3),positions)==0).any(1).reshape(-1,2).all(1)) ...: return pairs[mask_idx] ...: ...: def XYZ_merged(pairs,positions): ...: maxlen = np.max(pairs,axis=(0,1)) ...: dims = np.append(maxlen[::-1][:-1].cumprod()[::-1],1) ...: pairs1D = np.dot(pairs.reshape(-1,3),dims) ...: positions1D = np.dot(positions,dims) ...: mask_idx1 = ~(np.in1d(pairs1D,positions1D).reshape(-1,2).all(1)) ...: return pairs[mask_idx1] ...: In [82]: %timeit cdist_split(pairs,positions) 1 loops, best of 3: 662 ms per loop In [83]: %timeit cdist_merged(pairs,positions) 1 loops, best of 3: 615 ms per loop In [84]: %timeit XYZ_merged(pairs,positions) 100 loops, best of 3: 4.02 ms per loop 

Check Results -

 In [85]: np.allclose(cdist_split(pairs,positions),cdist_merged(pairs,positions)) Out[85]: True In [86]: np.allclose(cdist_split(pairs,positions),XYZ_merged(pairs,positions)) Out[86]: True 
+6
source share

Development of my comment:

Expand pairs to be more interesting. Feel free to test with a larger, more realistic array:

 In [260]: pairs = np.array([[[1,2,4],[3,4,4]],[[1,2,5],[5,6,5]],[[3,4,5],[3,5,6]],[[6,7,5],[1,2,3]]]) In [261]: positions = np.array([[ 1, 2, 4], [ 3, 4, 5], [ 5, 6, 3], [ 3, 5, 6], [ 6, 7, 5], [12, 2, 5]]) 

Expand both arrays into broadcast forms:

 In [262]: I = pairs[None,...]==positions[:,None,None,:] In [263]: I.shape Out[263]: (6, 4, 2, 3) 

A large logical array that displays elements by elements in all dimensions. Could not replace other comparisons ( difference ==0 , np.isclose for floats, etc.).

 In [264]: J = I.all(axis=-1).any(axis=0).sum(axis=-1) In [265]: J Out[265]: array([1, 0, 2, 1]) 

Consolidation of results by various parameters. Match all numbers by coordinates, match any positions, count matches in pairs.

 In [266]: pairs[J==1,...] Out[266]: array([[[1, 2, 4], [3, 4, 4]], [[6, 7, 5], [1, 2, 3]]]) 

J==1 represent elements in which only one value of the pair matches. (see note)

The combination of any , and and sum works for a test case, but adjustment with a larger test case may be required. But the idea is generally applicable.


For the size of arrays that are fooobar.com/questions/829351 / ... , my solution is rather slow. In particular, he performs the test == , which leads to I with the form (5000, 5000, 2, 3) .

Squeezing the last measurement helps a lot

 dims = np.array([10000,100,1]) # simpler version of dims from XYZmerged pairs1D = np.dot(pairs.reshape(-1,3),dims) positions1D = np.dot(positions,dims) I1d = pairs1D[:,None]==positions1D[None,:] J1d = I1d.any(axis=-1).reshape(pairs.shape[:2]).sum(axis=-1) 

I changed the expression of J1d according to mine - to count the number of matches per pair.

in1d1 , which uses Divakar , is even faster:

 mask = np.in1d(pairs1D, positions1D).reshape(-1,2) Jmask = mask.sum(axis=-1) 

I just realized that the OP is asking at most one of the pairs is in the positions array . Where I test exactly one match per pair . So, my tests should be changed to pairs[J<2,...] .

(in my specific random sample for n = 5000, which, as it turned out, is everything. There are no pairs where both are in positions . 54 out of 5000 have J==1 , the rest 0, there are no matches).

+3
source share

All Articles