K - means an empty cluster

I am trying to implement k-means as homework. My exercise sheet gives me the following remark about empty centers:

During iterations, if any of the cluster centers does not have data points associated with it, replace it with a random data point.

It bothers me a little, firstly, Wikipedia or other sources that I read do not mention it at all. Then I read about the problem with “choosing a good k for your data” - how should my algorithm converge if I start to establish new centers for the cluster that were empty.

If I ignore empty clusters, I converged after 30-40 iterations. Is it wrong to ignore empty clusters?

+8
k-means
source share
4 answers

See an example of using empty clusters: http://www.ceng.metu.edu.tr/~tcan/ceng465_f1314/Schedule/KMeansEmpty.html This basically means either 1) random tremor in effect, or 2) the number of clusters k is incorrect. You have to go through several different values ​​for k and choose the best one. If during an iteration you should encounter an empty cluster, place a random data point in this cluster and continue. Hope this helped you in your homework last year.

+4
source share

Processing empty clusters is not part of the k-means algorithm, but can lead to better cluster quality. Speaking of convergence, it is never accurate, but only heuristically guaranteed and, therefore, the convergence criterion expands to include the maximum number of iterations.

As for the strategy for solving this problem, I would say that the random assignment of some data indicates that it is not very smart, since we can influence the quality of the clusters, since the distance to the center to which it is currently assigned is large or small. A heuristic for this case would be to select the farthest point from the largest cluster and move this empty cluster, and then do this until there are no empty clusters.

+2
source share

You cannot ignore empty clusters, but replace it. k-mean is an algorithm that can only provide local minima, and empty clusters are local minima that you don't want. your program will converge even if you replace the point with a random one. Remember that at the beginning of the algorithm, you select the initial K-points at random. if it can converge, then why does K-1 converge with one random point? several more iterations are required.

+1
source share

“Choosing a good k for your data” refers to the problem of choosing the right number of clusters. Since the k-means algorithm works with a given number of cluster centers, their number must be selected first. Choosing the wrong number can make it difficult to split data points into clusters, or clusters can become small and meaningless.

I cannot give you an answer on whether to ignore empty clusters. If you do this, you may encounter fewer clusters than you identified at the beginning. This will confuse people who expect k to work in a certain way, but this is not necessarily a bad idea.

If you reinstall all empty cluster centers, your algorithm will probably converge if this happens a limited number of times. However, if you have to move too often, it may happen that your algorithm does not complete.

+1
source share

All Articles