How to specify distance function for clustering?

I would like the points of the clusters to be set at an arbitrary distance, and, oddly enough, it seems that neither the scipy nor sklearn clustering methods allow specifying a remote function.

For example, in sklearn.cluster.AgglomerativeClustering only thing I can do is enter an affinity matrix (which will be very hard for memory). To build this matrix itself, it is recommended to use sklearn.neighbors.kneighbors_graph , but I don’t understand how to specify the distance function between two points. Can anyone enlighten me?

+7
python scipy scikit-learn hierarchical-clustering
source share
3 answers

All scipy hierarchical clustering routines will accept a custom distance function that takes two 1D vectors that define a pair of points and return a scalar. For example, using fclusterdata :

 import numpy as np from scipy.cluster.hierarchy import fclusterdata # a custom function that just computes Euclidean distance def mydist(p1, p2): diff = p1 - p2 return np.vdot(diff, diff) ** 0.5 X = np.random.randn(100, 2) fclust1 = fclusterdata(X, 1.0, metric=mydist) fclust2 = fclusterdata(X, 1.0, metric='euclidean') print(np.allclose(fclust1, fclust2)) # True 

Valid values ​​for metric= kwarg are the same as for scipy.spatial.distance.pdist .

+9
source share

For hierarchical clustering, scipy.cluster.hierarchy.fclusterdata allows you to use any distance metrics included in the list here using the metric= keyword argument if it works with your binding method.

+1
source share

sklearn has a DBSCAN that allows you to use pre-computed distance matrices (using a triangular matrix, where M_ij is the distance between i and j). But this may not be the type of clustering you are looking for.

In addition, as mentioned above, scipy.cluster.hierarchy.fclusterdata also allows you to pre-compute distance metrics. There is a code snippet provided in this answer that gives some code to convert the NxN distance matrix to a format that fclusterdata can read easily:

 import scipy.spatial.distance as ssd # convert the redundant n*n square matrix form into a condensed nC2 array distArray = ssd.squareform(distMatrix) # distArray[{n choose 2}-{ni choose 2} + (ji-1)] is the distance between points i and j 
+1
source share

All Articles