Uncontrolled clustering with an unknown number of clusters

Question

Uncontrolled clustering with an unknown number of clusters

I have a large set of vectors in three dimensions. I need to group them based on the Euclidean distance, so that all vectors in any particular cluster have a Euclidean distance between themselves less than the threshold "T".

I do not know how many clusters exist. In the end, there may be separate vectors that are not part of any cluster, because its Euclidean distance is not less than "T" with any of the vectors in space.

What existing algorithms / approach should be used here?

Thanks Abhishek S

+52

math algorithm artificial-intelligence machine-learning cluster-analysis

London guy Apr 13 2018-12-12T00:

source share

3 answers

The moooeeeep answer is recommended for hierarchical clustering. I wanted to talk about how to choose the complexity of clustering.

One way is to calculate the clusters based on different threshold values t1, t2, t3, ... and then calculate the metric for the “quality” of clustering. The premise is that the quality of clustering with the optimal number of clusters will have the maximum value of the quality indicator.

An example of a quality metric that I have used in the past is Calinski-Harabasz. In short: you calculate the average intercluster distances and divide them by the distances within the cluster. The optimal clustering destination will be the clusters that are most separated from each other, and the clusters that are the “densest”.

By the way, you do not need to use hierarchical clustering. You can also use something like k-means, recompute it for each k, and then choose k, which has the highest Calinski-Harabasz score.

Let me know if you need more links and I will comb my hard drive for some documents.

+19

Max Apr 13 2018-12-12T00:

source share

Check the DBSCAN algorithm. These are clusters based on local vector density, i.e. They should not be more than at some distance from it, and can automatically determine the number of clusters. It also takes into account emissions, i.e. Points with an insufficient number of ε-neighbors, so as not to be part of the cluster. The Wikipedia page refers to several implementations.

+11

Fred Foo Apr 13 '12 at 9:50

source share

moooeeeep · Accepted Answer · 2012-04-13 08:12

You can use hierarchical clustering . This is a fairly simple approach, so many implementations are available. This, for example, is included in Python scipy .

See for example the following script:

import matplotlib.pyplot as plt import numpy import scipy.cluster.hierarchy as hcluster # generate 3 clusters of each around 100 points and one orphan point N=100 data = numpy.random.randn(3*N,2) data[:N] += 5 data[-N:] += 10 data[-1:] -= 20 # clustering thresh = 1.5 clusters = hcluster.fclusterdata(data, thresh, criterion="distance") # plotting plt.scatter(*numpy.transpose(data), c=clusters) plt.axis("equal") title = "threshold: %f, number of clusters: %d" % (thresh, len(set(clusters))) plt.title(title) plt.show()

Which gives a result similar to the following image. clusters

The threshold specified as a parameter is the distance value, based on which a decision is made on whether points / clusters will be combined into another cluster. You can also specify a distance metric.

Note that there are various ways of calculating intra- / intercluster similarities, for example. distance between nearest points, distance between farthest points, distance to cluster centers, etc. Some of these methods are also supported by the scipys hierarchical cluster module ( single / complete / average ... linkage ). According to your post, I think you would like to use full binding .

Please note that this approach also allows for small (single-point) clusters if they do not meet the similarity criteria of other clusters, i.e. threshold of distance.

There are other algorithms that will work better, which will become relevant in situations with a large number of data points. Like other answers / comments, you can also take a look at the DBSCAN algorithm:

For a good overview of these and other clustering algorithms, also check out this demo page (from the scikit-learn Python library):

http://scikit-learn.org/stable/modules/clustering.html

Image copied from this location:

As you can see, each algorithm makes some assumptions about the number and shape of clusters that need to be considered. Be it implicit assumptions imposed by the algorithm or explicit assumptions specified by parameterization.

Uncontrolled clustering with an unknown number of clusters

More articles: