Using silhouette clustering into sparks

I want to use a silhouette to determine the optimal value of k when using KMeans clusters in Spark. Is there an optimal way to parallelize this? those. make it scalable

+7
source share
4 answers

No, a silhouette by definition does not scale.

It uses pairwise distances, it always takes O (n ^ 2) time to calculate.

You will need to use something else. Using Silhouette on big data is absurd, it takes a lot more time to calculate the estimate than to run the actual k-means clustering algorithm.

Or change your mind what you are doing. Does it make sense to use a silhouette in general, for example. You can also decide to run something faster than Spark on individual nodes, calculate the Silhouette there and simply parallelize it through k without all the overhead of distributed computing. Spark may beat MapReduce-Mahout, but it will lose against a good unallocated implementation.

+3
source

Yes, there are ways to make a large-scale silhouette metric. No, it is not fully published, as I will describe here. It’s not so difficult, so you can understand it and maybe write it too. Just let me know, please, so that I can use it if you write it first.

It seems like we both need to write a high-performance scorer. Enter any cluster column vector, giving this evaluation function the ability to work with any clustering implementation. If possible, use mapreduce for a simple distributed version, as well as for shared memory. It looks possible. Page 4 shows the math: http://cran.us.r-project.org/web/packages/clValid/vignettes/clValid.pdf The LSH algorithm will help algorithmically, since it avoids the exact distance calculations that dominate its math. A good LSH implementation would then be necessary, but I did not find it. Sklearns LSH Forest is the right idea, but it is not implemented well enough. A simplified or approximate silhouette will also be interesting. Enabling LSH will produce approximate results. Use the LSH feature to find only the nearest point and centroid, which avoids the calculation of all pairs. There are some useful tips on page 28 of this article: https://arxiv.org/pdf/1605.01802.pdf It seems they say: use a simplified silhouette, not a simple silhouette, as shown below: Change the calculations from the distance from point to point to distance from point to cluster centroid. This is an abbreviation from all pairs of points in the cluster and the nearest neighboring cluster, which is O (n ^ 2), to a linear calculation of the length O (N). Here is my understanding and translation:

Start with: File of cluster tuples: (clusterID, cluster centroid point) File of example point tuples: (example point, clusterID). Notice the clusterID is the clusterID column vector described above. Processing: For each cluster tuple, run a map(): hashmap[clusterID] = cluster centroid point For each example point tuple, run: map(): (dist = distance between point and its cluster centroid point, example point (copy), clusterID(copy)) map(): find closest cluster centroid point to the example point map(): emit SSI = (distance - minClusterDistance)/minClusterDistance reduce(): By clusterID emit (clusterID, clusters sum of SSI / #points) 

I can end up being a developer. This is crazy, no one wrote fast like this before. People did this already in my expectation, but they keep them with them for competitive purposes (corporate profit, Kaggle placement, etc.).

The above is formatted as code, but not code. This is an English scheme or pseudo code. stackoverflow made me format this section as code in order to accept it.

+5
source

ClusteringEvaluator has been available since Spark 2.3.0, which calculates the Silhouette score.

0
source

I can not add a comment, but I want to highlight the answer from Yulin Guo:

ClusteringEvaluator has been available since Spark 2.3.0, which calculates the Silhouette score.

This is a scalable, efficient implementation introduced in SPARK-14516 .

0
source

All Articles