Group Discovery in Datasets

Question

Group Discovery in Datasets

Suppose a group of data points, for example, built here (this graph is not specific to my problem, but simply used as a suitable example):

By visualizing the scatter graph, it is fairly obvious that the data points form two “groups” with some random points that clearly do not belong to either.

I am looking for an algorithm that would allow me:

start with a data set of two or more dimensions.
detect such groups from a dataset without first knowing how much (or if it can) be there
after finding groups, “ask” the group model if the new sample point seems to match any of the groups

+7

algorithm statistics probability feature-detection

Sami Jan 12 '10 at 20:58

source share

3 answers

I think you are looking for something like k-means clustering algorithm .

You should be able to find adequate implementations in most general purpose languages.

+3

ConsultUtah Jan 12 '10 at 21:01

source share

You need one of the clustering algorithms. All of them can be divided into two groups:

you indicate the number of groups (clusters) - 2 clusters in your example Algorithm
trying to guess the correct number of clusters by itself

If you need a type 1 algorithm, then K-Means is what you really need.

If you need a type 2 algorithm, you will probably need one of the hierarchical clustering algorithms. I have never realized them. But I see a simple way to improve K-tools in such a way that there is no need to specify the number of clusters.

+2

Roman Jan 12 '10 at 22:39

source share

Tristan · Accepted Answer · 2010-01-12T22:14:13+0000

There are many options, but if you are interested in the likelihood that a new data point belongs to a particular mixture, I would use a probabilistic approach, such as modeling a Gaussian mixture, either estimated by maximum likelihood or by Bayes.

Matlab implements maximum likelihood score .

Your requirement that the number of components is unknown makes your model more complex. The dominant probabilistic approach is to conduct the Dirichlet process prior to the distribution of the mixture and evaluate it using the Bayesian method. For example, see this article on endless Gaussian mix models . The DP mixer model will give you an idea of the number of components and components to which all the elements belong, exactly what you want. Alternatively, you can choose a model by the number of components, but this is usually less elegant.

There are many model options for DP mixers, but they may not be as convenient. For example, here is the implementation of Matlab .

In your schedule, you indicate that you are user R. In this case, if you are looking for ready-made solutions, the answer to your question lies with this representation of the task for analyzing the cluster .

Group Discovery in Datasets

More articles: