Uncontrolled clustering with an unknown number of clusters

I have a large set of vectors in three dimensions. I need to group them based on the Euclidean distance, so that all vectors in any particular cluster have a Euclidean distance between themselves less than the threshold "T".

I do not know how many clusters exist. In the end, there may be separate vectors that are not part of any cluster, because its Euclidean distance is not less than "T" with any of the vectors in space.

What existing algorithms / approach should be used here?

Thanks Abhishek S

+52
math algorithm artificial-intelligence machine-learning cluster-analysis
Apr 13 2018-12-12T00:
source share
3 answers

You can use hierarchical clustering . This is a fairly simple approach, so many implementations are available. This, for example, is included in Python scipy .

See for example the following script:

import matplotlib.pyplot as plt import numpy import scipy.cluster.hierarchy as hcluster # generate 3 clusters of each around 100 points and one orphan point N=100 data = numpy.random.randn(3*N,2) data[:N] += 5 data[-N:] += 10 data[-1:] -= 20 # clustering thresh = 1.5 clusters = hcluster.fclusterdata(data, thresh, criterion="distance") # plotting plt.scatter(*numpy.transpose(data), c=clusters) plt.axis("equal") title = "threshold: %f, number of clusters: %d" % (thresh, len(set(clusters))) plt.title(title) plt.show() 

Which gives a result similar to the following image. clusters

The threshold specified as a parameter is the distance value, based on which a decision is made on whether points / clusters will be combined into another cluster. You can also specify a distance metric.

Note that there are various ways of calculating intra- / intercluster similarities, for example. distance between nearest points, distance between farthest points, distance to cluster centers, etc. Some of these methods are also supported by the scipys hierarchical cluster module ( single / complete / average ... linkage ). According to your post, I think you would like to use full binding .

Please note that this approach also allows for small (single-point) clusters if they do not meet the similarity criteria of other clusters, i.e. threshold of distance.




There are other algorithms that will work better, which will become relevant in situations with a large number of data points. Like other answers / comments, you can also take a look at the DBSCAN algorithm:




For a good overview of these and other clustering algorithms, also check out this demo page (from the scikit-learn Python library):

Image copied from this location:

http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html

As you can see, each algorithm makes some assumptions about the number and shape of clusters that need to be considered. Be it implicit assumptions imposed by the algorithm or explicit assumptions specified by parameterization.

+54
Apr 13 '12 at 8:12
source share

The moooeeeep answer is recommended for hierarchical clustering. I wanted to talk about how to choose the complexity of clustering.

One way is to calculate the clusters based on different threshold values โ€‹โ€‹t1, t2, t3, ... and then calculate the metric for the โ€œqualityโ€ of clustering. The premise is that the quality of clustering with the optimal number of clusters will have the maximum value of the quality indicator.

An example of a quality metric that I have used in the past is Calinski-Harabasz. In short: you calculate the average intercluster distances and divide them by the distances within the cluster. The optimal clustering destination will be the clusters that are most separated from each other, and the clusters that are the โ€œdensestโ€.

By the way, you do not need to use hierarchical clustering. You can also use something like k-means, recompute it for each k, and then choose k, which has the highest Calinski-Harabasz score.

Let me know if you need more links and I will comb my hard drive for some documents.

+19
Apr 13 2018-12-12T00:
source share

Check the DBSCAN algorithm. These are clusters based on local vector density, i.e. They should not be more than at some distance from it, and can automatically determine the number of clusters. It also takes into account emissions, i.e. Points with an insufficient number of ฮต-neighbors, so as not to be part of the cluster. The Wikipedia page refers to several implementations.

+11
Apr 13 '12 at 9:50
source share



All Articles