K - means an empty cluster

Question

K - means an empty cluster

I am trying to implement k-means as homework. My exercise sheet gives me the following remark about empty centers:

During iterations, if any of the cluster centers does not have data points associated with it, replace it with a random data point.

It bothers me a little, firstly, Wikipedia or other sources that I read do not mention it at all. Then I read about the problem with “choosing a good k for your data” - how should my algorithm converge if I start to establish new centers for the cluster that were empty.

If I ignore empty clusters, I converged after 30-40 iterations. Is it wrong to ignore empty clusters?

+8

k-means

toobee Jun 17 '12 at 22:23

source share

4 answers

offwhitelotus · Answer 1 · 2013-11-06T20:26:52+0000

See an example of using empty clusters: http://www.ceng.metu.edu.tr/~tcan/ceng465_f1314/Schedule/KMeansEmpty.html This basically means either 1) random tremor in effect, or 2) the number of clusters k is incorrect. You have to go through several different values for k and choose the best one. If during an iteration you should encounter an empty cluster, place a random data point in this cluster and continue. Hope this helped you in your homework last year.

gantzer89 · Answer 2 · 2014-01-25T17:28:32+0000

Processing empty clusters is not part of the k-means algorithm, but can lead to better cluster quality. Speaking of convergence, it is never accurate, but only heuristically guaranteed and, therefore, the convergence criterion expands to include the maximum number of iterations.

As for the strategy for solving this problem, I would say that the random assignment of some data indicates that it is not very smart, since we can influence the quality of the clusters, since the distance to the center to which it is currently assigned is large or small. A heuristic for this case would be to select the farthest point from the largest cluster and move this empty cluster, and then do this until there are no empty clusters.

Fivesheep · Answer 3 · 2012-06-18T20:33:36+0000

You cannot ignore empty clusters, but replace it. k-mean is an algorithm that can only provide local minima, and empty clusters are local minima that you don't want. your program will converge even if you replace the point with a random one. Remember that at the beginning of the algorithm, you select the initial K-points at random. if it can converge, then why does K-1 converge with one random point? several more iterations are required.

Konstantin schubert · Answer 4 · 2013-03-13T17:49:48+0000

“Choosing a good k for your data” refers to the problem of choosing the right number of clusters. Since the k-means algorithm works with a given number of cluster centers, their number must be selected first. Choosing the wrong number can make it difficult to split data points into clusters, or clusters can become small and meaningless.

I cannot give you an answer on whether to ignore empty clusters. If you do this, you may encounter fewer clusters than you identified at the beginning. This will confuse people who expect k to work in a certain way, but this is not necessarily a bad idea.

If you reinstall all empty cluster centers, your algorithm will probably converge if this happens a limited number of times. However, if you have to move too often, it may happen that your algorithm does not complete.

K - means an empty cluster

More articles: