Error maximizing expectations - how to find the optimal number of Gaussians in the data

Question

Error maximizing expectations - how to find the optimal number of Gaussians in the data

Plot of 2 - Dimensional data

Is there any algorithm or trick how to determine the number of gausses that need to be identified in the data set before applying the wait maximization algorithm?

For example, in the above figure of 2-dimensional data, when I use the algorithm for maximizing expectations, I try to bring 4 Gaussians to the data, and I would get the following result.

enter image description here

But what if I did not know the number of Gaussians in the data? Is there any algorithm or trick I could apply to find out this detail?

+4

matlab data-mining expectation-maximization

Simon Jun 29 '11 at 18:50

source share

3 answers

This may be a bit of a tread, as others have already linked the wiki article to the actual cluster number definition, but I found that this article was too dense, so I thought I would give a short, intuitive answer:

In principle, for the number of clusters in the data set, there is no universal “correct” answer - the smaller the clusters, the shorter the description length, the higher the variance, and in all non-trivial data sets the variance will not completely go away if you do not have a gauss for each point, which makes clustering useless (this is a case of more general phenomena known as the “futility of free learning”: a student who does not make a priori assumptions about the identity of the target concept does not have a rational basis for classifying any identifiable instances).

Thus, you basically need to select some function of your dataset to maximize the number of clusters (see the inductive bias wiki article for some examples)

In other sad news in all such cases, detecting the number of clusters is known as NP-hard , so best of all you can expect is a good heuristic approach.

+8

zergylord Jun 29 '11 at 22:16

source share

Nonparametric Bayesian clustering now receives a lot of attention. You do not need to specify clusters.
Autoclass is an algorithm that automatically identifies the number of clusters from a mixture.

+1

Ejaz Aug 14 '11 at 16:49

source share

David brown · Accepted Answer · 2011-06-29T19:28:36+0000

Wikipedia has an article on this topic. I don’t know much about this issue, but I was told that clustering algorithms that do not require an indication of the number of clusters instead require some information about the density of the clusters or the minimum distance between the clusters.

Error maximizing expectations - how to find the optimal number of Gaussians in the data

More articles: