MATLAB kMeans does not always converge to global lows

Question

MATLAB kMeans does not always converge to global lows

I wrote the k-Means clustering algorithm in MATLAB, and I thought I would try it against MATLABs built into kmeans(X,k) .

However, for a very simple setup of four clusters (see the figure), MATLAB kMeans does not always converge to the optimal solution (on the left), but (on the right).

The one I wrote does not always do this, but shouldn't the built-in function solve such a difficult task, always find the optimal solution?

alt text

+7

matlab machine-learning cluster-analysis k-means

Theodor Sep 7 '10 at 10:30

source share

5 answers

The k-means algorithm is quite sensitive to the initial guess for cluster centers. Have you tried to use both codes with the same starting centers?

The algorithm is simple, and I doubt that there are many differences between your implementation and Matlab's.

+4

Alexandre C. Sep 7 '10 at 11:25

source share

I would not call this an easy problem. :) In fact, the Wikipedia article on "k-mean clustering" gives a rather bleak picture of computational complexity.

If you want to be free from accidental restarts (depending on the initial guess), the trade-off is the "global k-means algorithm"; paper and matlab code can be found here: http://lear.inrialpes.fr/~verbeek/software.php

+3

IBC Jan 12 '11 at 17:01

source share

You will probably often be disappointed with a solution that resorts to any particular run of the "k-means algorithm" (that is, Lloyd's algorithm). This is because Lloyd's algorithm often gets stuck in bad local minima.

Fortunately, Lloyd is just one way to solve k-tools. And there is an approach that almost always finds the best local lows.

The trick is updating cluster data point assignments one at a time. You can do this efficiently by keeping a count of the number of points n assigned to each average value. So that you can recalculate the average value of cluster m after deleting point x as follows:

 m_new = (n * m - x) / (n - 1)

And add x to medium m using:

 m_new = (n * m + x) / (n + 1)

Of course, because it cannot be vectorized, it is a little painful for him to work in MATLAB, but not so bad in other languages.

If you are really interested in getting the highest possible local minimums, and you do not mind using instance-based clustering, you should look at the distribution of affinity . MATLAB implementations are available on the Frey Affinity Distribution Page page .

+2

qdjm 15 Sep '10 at 2:27

source share

Although K-Means ++ will not solve the problem in a single pass, it tends to give better results when it is run N times (compared to running the original K-Means algorithm N times).

+2

rwong Oct 3 '10 at 8:29

source share

Amro · Accepted Answer · 2010-09-07T21:42:30+0000

As @Alexandre C. explained, the K-means algorithm depends on the initial positions of the cluster centroid, and there is no guarantee that it will converge to the optimal solution.

The best you can do is repeat the experiment several times with random starting points.

The MATLAB implementation offers this option: replicates , which repeats clustering N times and selects the one that has the lowest total point-centroid cluster distance within. You can also control how the source centroids are selected using the start option.

In addition, MATLAB provides a choice among a range of distance measures (Euclidean, Manhattan, Cosine, ...). One optional emptyaction option allows you to control what happens when a cluster loses all members assigned to it during iterations.

But the real advantage is that it uses a two-phase algorithm: regular iterations with assignment-re-issuance, and then the online update phase. For more information, read the page algorithm section.

MATLAB kMeans does not always converge to global lows

More articles: