Regarding the quality of various k-means algorithms

I see that for k-means we have Lloyd's algorithm, Elkan algorithm, and we also have a hierarchical version of k-means.

For all of these algorithms, I see that the Elkan algorithm can provide acceleration in terms of speed. But what I want to know is the quality from all of these k-means algorithms. Each time we run these algorithms, the result will be different, due to their heuristic and probabilistic nature. Now my question is when it comes to a clustering algorithm, such as a k-tool, if we want to get a better result (like with less distortion, etc.) Between all these k-means algorithms that the algorithm could give you the best quality? Is it possible to measure such a thing?

+4
source share
4 answers

The best solution is usually one that has a better (lower) value of J(x,c) , where:

 J(x,c) = 1/|x| * Sum(distance(x(i),c(centroid(i)))) for each i in [1,|x|] 

Wherre:

  • x is a list of samples
  • |x| - size x (number of elements)
  • [1,|x|] all numbers from 1 to |x| (inclusive)
  • c is the list of centroids (or means) of the clusters (ie, for k clusters | c | = k)
  • distance(a,b) (sometimes denoted as || ab || - distance between the "point" a to the "point" b (in Euclidean 2D space it is sqrt((ax-bx)^2 + (ay-by)^2) ))
  • centroid (i) - the centroid / average that is closest to x(i)

Please note that this approach does not require switching to a controlled technique and can be fully automated!

+4
source

As I understand it, you need some tagged data to crosscheck the clustering algorithm.

+1
source

What about the pathological case of a dataset with two moons? uncontrolled k-funds will fail. The high quality method that I know of uses a more probabilistic approach using mutual information and combinatorial optimization. Basically, you ask the clustering problem as the problem of finding the optimal subset [cluster] of a complete set of points for the case of two clusters.

Here you can find the relevant article (page 42) and the corresponding Matlab Code here to play with (checkout the two-moons case). If you are interested in a high-performance C ++ implementation with acceleration> 30x, you can find HPSFO here .

+1
source

To compare quality, you must have a labeled dataset and evaluate the results against some criteria, such as NMI

0
source

All Articles