Fast fuse near points in numpy-2d (no loop)

I have a question similar to the question asked here: an easy way to merge several close points . I want to replace the points located close to each other with the average value of their coordinates. The proximity in the cells is indicated by the user (I'm talking about the Euclidean distance).

In my case, I have a lot of points (about 1 million). This method works, but it takes a lot of time, because it uses a double for loop.

Is there a faster way to detect and smooth close points in a numpy 2d array?


To be complete, I added an example:

points=array([[ 382.49056159, 640.1731949 ], [ 496.44669161, 655.8583119 ], [ 1255.64762859, 672.99699399], [ 1070.16520917, 688.33538171], [ 318.89390168, 718.05989421], [ 259.7106383 , 822.2 ], [ 141.52574427, 28.68594436], [ 1061.13573287, 28.7094536 ], [ 820.57417943, 84.27702407], [ 806.71416007, 108.50307828]]) 

The dot scatter plot is shown below. The red circle indicates points that are close to each other (in this case, the distance is 27.91 between the last two points in the array). Therefore, if the user will indicate a minimum distance of 30, these points should be fused.

enter image description here

At the output of the fuse function, the last points merge. It will look like this:

 #output array([[ 382.49056159, 640.1731949 ], [ 496.44669161, 655.8583119 ], [ 1255.64762859, 672.99699399], [ 1070.16520917, 688.33538171], [ 318.89390168, 718.05989421], [ 259.7106383 , 822.2 ], [ 141.52574427, 28.68594436], [ 1061.13573287, 28.7094536 ], [ 813.64416975, 96.390051175]]) 
+5
source share
2 answers

If you have a lot of points, it might be faster to create a kD tree using scipy.spatial.cKDTree , then query it for pairs of points that are closer than some threshold values:

 import numpy as np from scipy.spatial import cKDTree tree = cKDTree(points) rows_to_fuse = tree.query_pairs(r=30) print(repr(rows_to_fuse)) # {(8, 9)} print(repr(points[list(rows_to_fuse)])) # array([[ 820.57417943, 84.27702407], # [ 806.71416007, 108.50307828]]) 

The main advantage of this approach is that you do not need to calculate the distance between each pair of points in your dataset.

+5
source

You can use scipy distance functions like pdist to quickly find which points should be combined:

 import numpy as np from scipy.spatial.distance import pdist, squareform d = squareform(pdist(a)) d = np.ma.array(d, mask=np.isclose(d, 0)) a[d.min(axis=1) < 30] #array([[ 820.57417943, 84.27702407], # [ 806.71416007, 108.50307828]]) 

Note

For large samples, this method can cause memory errors because it stores a complete matrix containing relative distances.

+1
source

All Articles