Clustering using custom distance metrics for lat / long pairs

Question

Clustering using custom distance metrics for lat / long pairs

I am trying to specify a custom clustering function to implement DBSCAN scikit-learn:

def geodistance(latLngA, latLngB): print latLngA, latLngB return vincenty(latLngA, latLngB).miles cluster_labels = DBSCAN( eps=500, min_samples=max(2, len(found_geopoints)/10), metric=geodistance ).fit(np.array(found_geopoints)).labels_

However, when I print out the arguments to my distance function, they are not at all what I would expect:

 [ 0.53084126 0.19584111 0.99640966 0.88013373 0.33753788 0.79983037 0.71716144 0.85832664 0.63559538 0.23032912] [ 0.53084126 0.19584111 0.99640966 0.88013373 0.33753788 0.79983037 0.71716144 0.85832664 0.63559538 0.23032912]

This is what my found_geopoints array looks like:

 [[ 4.24680600e+01 1.40868060e+02] [ -2.97677600e+01 -6.20477000e+01] [ 3.97550400e+01 2.90069000e+00] [ 4.21144200e+01 1.43442500e+01] [ 8.56111000e+00 1.24771390e+02] ...

So why not arguments for a pair of longitude latitude distances?

+7

scikit-learn cluster-analysis dbscan

Nathan breit May 02 '14 at 4:11

source share

2 answers

Nathan breit · Answer 1 · 2014-05-02T05:38:35+0000

It seems I found a job where I compute the distance matrix using: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html then use it as an argument for DBSCAN(metric='precomputed').fit(distance_matrix)

eos · Answer 2 · 2016-08-02T22:52:21+0000

You can do this with scikit-learn: use the haversine label with the ball chart algorithm and pass the radiation units to the DBSCAN substitution method.

This tutorial shows how to cluster spatial long data using sciskit-learn DBSCAN using the haversine label for a cluster based on exact geodetic distances between lat-long points:

 df = pd.read_csv('gps.csv') coords = df.as_matrix(columns=['lat', 'lon']) db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Note that the coordinates are passed to the .fit() method as radians and that the epsilon parameter must also be in radians.

If you want epsilon to be, say, 1.5 km, then the epsilon parameter in radiation units would be = 1.5 / 6371.

Clustering using custom distance metrics for lat / long pairs

More articles: