Optimize your nearest neighbor search

I am trying to find all the nearest neighbors within a radius of 1 KM. Here is my script for building a tree and finding nearby points,

from pysal.cg.kdtree import KDTree

def construct_tree(s):
    data_geopoints = [tuple(x) for x in s[['longitude','latitude']].to_records(index=False)]
    tree = KDTree(data_geopoints, distance_metric='Arc', radius=pysal.cg.RADIUS_EARTH_KM)
    return tree

def get_neighbors(s,tree):
    indices = tree.query_ball_point(s, 1)
    return indices

#Constructing the tree for search
tree = construct_tree(data)

#Finding the nearest neighbours within 1KM
data['neighborhood'] = data['lat_long'].apply(lambda row: get_neighbors(row,tree))

From what I read on the pysal page, it says:

kd-tree built on top of the kd-tree functionality in scipy. If you use scipy 0.12 or more, use scipy.spatial.cKDTree, otherwise use scipy.spatial.KDTree.

In my case, it should use cKDTree. This works fine for a sample dataset, but since it tree.query_ball_pointreturns a list of indexes as a result. Each list will contain 100 items. For my data (2 million records), it gets bigger and bigger and stops due to a memory problem after a certain point. Any idea on how to solve this?

+6
1

, - , , (tree.query_ball_point ) , , . .

0

All Articles