Finding closest items in two lists / arrays in Python

Question

Finding closest items in two lists / arrays in Python

I have two numpy x and y arrays containing float values. For each value in x I want to find the closest element in y , without reusing elements from y . The result should be 1-1 mapping element indices from x to element indices from y. Here's a bad way to do this based on sorting. It removes every element that has been paired from the list. Without sorting, this would be bad, because pairing would depend on the order of the original input arrays.

 def min_i(values): min_index, min_value = min(enumerate(values), key=operator.itemgetter(1)) return min_index, min_value # unsorted elements unsorted_x = randn(10)*10 unsorted_y = randn(10)*10 # sort lists x = sort(unsorted_x) y = sort(unsorted_y) pairs = [] indx_to_search = range(len(y)) for x_indx, x_item in enumerate(x): if len(indx_to_search) == 0: print "ran out of items to match..." break # until match is found look for closest item possible_values = y[indx_to_search] nearest_indx, nearest_item = min_i(possible_values) orig_indx = indx_to_search[nearest_indx] # remove it indx_to_search.remove(orig_indx) pairs.append((x_indx, orig_indx)) print "paired items: " for k,v in pairs: print x[k], " paired with ", y[v]

I prefer to do this without sorting the elements first, but if they are sorted, I want to get the indices in the original, unsorted lists unsorted_x, unsorted_y. what's the best way to do this in numpy / scipy / python or using pandas? thanks.

edit : clarify. I am not trying to find the best fit for all the elements (for example, not to minimize the sum of the distances), but is better suited for each element, and this is normal if it is sometimes due to other elements. I assume that y is usually much larger than x , contrary to the above example, and therefore there are usually a lot of good tricks for every value of x in y , and I just want to find this efficiently.

can anyone show an example of scipy kdtrees for this? Documents are quite rare

 kdtree = scipy.spatial.cKDTree([x,y]) kdtree.query([-3]*10) # ?? unsure about what query takes as arg

+7

python numpy scipy pandas

user248237dfsf Mar 12 '13 at 14:05

source share

2 answers

This simplified code worked fine.

 N=12 M=15 X = [np.random.random() for i in range(N)] Y = [np.random.random() for i in range(M)] pair = [] for x in X: t = [abs(xy) for y in Y] ind = t.index(min(t)) pair.append((x,Y[ind])) X.remove(x) Y.remove(Y[ind]) print(pair)

-one

user3465030 Mar 26 '14 at 17:48

source share

Jaime · Accepted Answer · 2013-03-12T16:08:55+0000

EDIT 2 A solution using KDTree can work very well if you can select multiple neighbors that guarantee that you will have a unique neighbor for each element in your array. With the following code:

 def nearest_neighbors_kd_tree(x, y, k) : x, y = map(np.asarray, (x, y)) tree =scipy.spatial.cKDTree(y[:, None]) ordered_neighbors = tree.query(x[:, None], k)[1] nearest_neighbor = np.empty((len(x),), dtype=np.intp) nearest_neighbor.fill(-1) used_y = set() for j, neigh_j in enumerate(ordered_neighbors) : for k in neigh_j : if k not in used_y : nearest_neighbor[j] = k used_y.add(k) break return nearest_neighbor

and a sample of points n=1000 , I get:

 In [9]: np.any(nearest_neighbors_kd_tree(x, y, 12) == -1) Out[9]: True In [10]: np.any(nearest_neighbors_kd_tree(x, y, 13) == -1) Out[10]: False

Thus, k=13 is optimal, and then time:

 In [11]: %timeit nearest_neighbors_kd_tree(x, y, 13) 100 loops, best of 3: 9.26 ms per loop

But in the worst case, you may need k=1000 , and then:

 In [12]: %timeit nearest_neighbors_kd_tree(x, y, 1000) 1 loops, best of 3: 424 ms per loop

Which is slower than other options:

 In [13]: %timeit nearest_neighbors(x, y) 10 loops, best of 3: 60 ms per loop In [14]: %timeit nearest_neighbors_sorted(x, y) 10 loops, best of 3: 47.4 ms per loop

EDIT . Array sorting before searching is calculated for arrays of more than 1000 elements:

 def nearest_neighbors_sorted(x, y) : x, y = map(np.asarray, (x, y)) y_idx = np.argsort(y) y = y[y_idx] nearest_neighbor = np.empty((len(x),), dtype=np.intp) for j, xj in enumerate(x) : idx = np.searchsorted(y, xj) if idx == len(y) or idx != 0 and y[idx] - xj > xj - y[idx-1] : idx -= 1 nearest_neighbor[j] = y_idx[idx] y = np.delete(y, idx) y_idx = np.delete(y_idx, idx) return nearest_neighbor

With an array of 10,000 elements long:

 In [2]: %timeit nearest_neighbors_sorted(x, y) 1 loops, best of 3: 557 ms per loop In [3]: %timeit nearest_neighbors(x, y) 1 loops, best of 3: 1.53 s per loop

For smaller arrays, it is slightly worse.

You will need to iterate over all of your objects in order to implement the greedy algorithm of the nearest neighbor, if only to discard duplicates. With that in mind, this is the fastest I could come up with:

 def nearest_neighbors(x, y) : x, y = map(np.asarray, (x, y)) y = y.copy() y_idx = np.arange(len(y)) nearest_neighbor = np.empty((len(x),), dtype=np.intp) for j, xj in enumerate(x) : idx = np.argmin(np.abs(y - xj)) nearest_neighbor[j] = y_idx[idx] y = np.delete(y, idx) y_idx = np.delete(y_idx, idx) return nearest_neighbor

And now with:

 n = 1000 x = np.random.rand(n) y = np.random.rand(2*n)

I get:

 In [11]: %timeit nearest_neighbors(x, y) 10 loops, best of 3: 52.4 ms per loop

Finding closest items in two lists / arrays in Python

More articles: