Finding closest items in two lists / arrays in Python

I have two numpy x and y arrays containing float values. For each value in x I want to find the closest element in y , without reusing elements from y . The result should be 1-1 mapping element indices from x to element indices from y. Here's a bad way to do this based on sorting. It removes every element that has been paired from the list. Without sorting, this would be bad, because pairing would depend on the order of the original input arrays.

 def min_i(values): min_index, min_value = min(enumerate(values), key=operator.itemgetter(1)) return min_index, min_value # unsorted elements unsorted_x = randn(10)*10 unsorted_y = randn(10)*10 # sort lists x = sort(unsorted_x) y = sort(unsorted_y) pairs = [] indx_to_search = range(len(y)) for x_indx, x_item in enumerate(x): if len(indx_to_search) == 0: print "ran out of items to match..." break # until match is found look for closest item possible_values = y[indx_to_search] nearest_indx, nearest_item = min_i(possible_values) orig_indx = indx_to_search[nearest_indx] # remove it indx_to_search.remove(orig_indx) pairs.append((x_indx, orig_indx)) print "paired items: " for k,v in pairs: print x[k], " paired with ", y[v] 

I prefer to do this without sorting the elements first, but if they are sorted, I want to get the indices in the original, unsorted lists unsorted_x, unsorted_y. what's the best way to do this in numpy / scipy / python or using pandas? thanks.

edit : clarify. I am not trying to find the best fit for all the elements (for example, not to minimize the sum of the distances), but is better suited for each element, and this is normal if it is sometimes due to other elements. I assume that y is usually much larger than x , contrary to the above example, and therefore there are usually a lot of good tricks for every value of x in y , and I just want to find this efficiently.

can anyone show an example of scipy kdtrees for this? Documents are quite rare

 kdtree = scipy.spatial.cKDTree([x,y]) kdtree.query([-3]*10) # ?? unsure about what query takes as arg 
+7
source share
2 answers

EDIT 2 A solution using KDTree can work very well if you can select multiple neighbors that guarantee that you will have a unique neighbor for each element in your array. With the following code:

 def nearest_neighbors_kd_tree(x, y, k) : x, y = map(np.asarray, (x, y)) tree =scipy.spatial.cKDTree(y[:, None]) ordered_neighbors = tree.query(x[:, None], k)[1] nearest_neighbor = np.empty((len(x),), dtype=np.intp) nearest_neighbor.fill(-1) used_y = set() for j, neigh_j in enumerate(ordered_neighbors) : for k in neigh_j : if k not in used_y : nearest_neighbor[j] = k used_y.add(k) break return nearest_neighbor 

and a sample of points n=1000 , I get:

 In [9]: np.any(nearest_neighbors_kd_tree(x, y, 12) == -1) Out[9]: True In [10]: np.any(nearest_neighbors_kd_tree(x, y, 13) == -1) Out[10]: False 

Thus, k=13 is optimal, and then time:

 In [11]: %timeit nearest_neighbors_kd_tree(x, y, 13) 100 loops, best of 3: 9.26 ms per loop 

But in the worst case, you may need k=1000 , and then:

 In [12]: %timeit nearest_neighbors_kd_tree(x, y, 1000) 1 loops, best of 3: 424 ms per loop 

Which is slower than other options:

 In [13]: %timeit nearest_neighbors(x, y) 10 loops, best of 3: 60 ms per loop In [14]: %timeit nearest_neighbors_sorted(x, y) 10 loops, best of 3: 47.4 ms per loop 

EDIT . Array sorting before searching is calculated for arrays of more than 1000 elements:

 def nearest_neighbors_sorted(x, y) : x, y = map(np.asarray, (x, y)) y_idx = np.argsort(y) y = y[y_idx] nearest_neighbor = np.empty((len(x),), dtype=np.intp) for j, xj in enumerate(x) : idx = np.searchsorted(y, xj) if idx == len(y) or idx != 0 and y[idx] - xj > xj - y[idx-1] : idx -= 1 nearest_neighbor[j] = y_idx[idx] y = np.delete(y, idx) y_idx = np.delete(y_idx, idx) return nearest_neighbor 

With an array of 10,000 elements long:

 In [2]: %timeit nearest_neighbors_sorted(x, y) 1 loops, best of 3: 557 ms per loop In [3]: %timeit nearest_neighbors(x, y) 1 loops, best of 3: 1.53 s per loop 

For smaller arrays, it is slightly worse.


You will need to iterate over all of your objects in order to implement the greedy algorithm of the nearest neighbor, if only to discard duplicates. With that in mind, this is the fastest I could come up with:

 def nearest_neighbors(x, y) : x, y = map(np.asarray, (x, y)) y = y.copy() y_idx = np.arange(len(y)) nearest_neighbor = np.empty((len(x),), dtype=np.intp) for j, xj in enumerate(x) : idx = np.argmin(np.abs(y - xj)) nearest_neighbor[j] = y_idx[idx] y = np.delete(y, idx) y_idx = np.delete(y_idx, idx) return nearest_neighbor 

And now with:

 n = 1000 x = np.random.rand(n) y = np.random.rand(2*n) 

I get:

 In [11]: %timeit nearest_neighbors(x, y) 10 loops, best of 3: 52.4 ms per loop 
+6
source

This simplified code worked fine.

 N=12 M=15 X = [np.random.random() for i in range(N)] Y = [np.random.random() for i in range(M)] pair = [] for x in X: t = [abs(xy) for y in Y] ind = t.index(min(t)) pair.append((x,Y[ind])) X.remove(x) Y.remove(Y[ind]) print(pair) 
-one
source

All Articles