Yes, there are ways to make a large-scale silhouette metric. No, it is not fully published, as I will describe here. Itβs not so difficult, so you can understand it and maybe write it too. Just let me know, please, so that I can use it if you write it first.
It seems like we both need to write a high-performance scorer. Enter any cluster column vector, giving this evaluation function the ability to work with any clustering implementation. If possible, use mapreduce for a simple distributed version, as well as for shared memory. It looks possible. Page 4 shows the math: http://cran.us.r-project.org/web/packages/clValid/vignettes/clValid.pdf The LSH algorithm will help algorithmically, since it avoids the exact distance calculations that dominate its math. A good LSH implementation would then be necessary, but I did not find it. Sklearns LSH Forest is the right idea, but it is not implemented well enough. A simplified or approximate silhouette will also be interesting. Enabling LSH will produce approximate results. Use the LSH feature to find only the nearest point and centroid, which avoids the calculation of all pairs. There are some useful tips on page 28 of this article: https://arxiv.org/pdf/1605.01802.pdf It seems they say: use a simplified silhouette, not a simple silhouette, as shown below: Change the calculations from the distance from point to point to distance from point to cluster centroid. This is an abbreviation from all pairs of points in the cluster and the nearest neighboring cluster, which is O (n ^ 2), to a linear calculation of the length O (N). Here is my understanding and translation:
Start with: File of cluster tuples: (clusterID, cluster centroid point) File of example point tuples: (example point, clusterID). Notice the clusterID is the clusterID column vector described above. Processing: For each cluster tuple, run a map(): hashmap[clusterID] = cluster centroid point For each example point tuple, run: map(): (dist = distance between point and its cluster centroid point, example point (copy), clusterID(copy)) map(): find closest cluster centroid point to the example point map(): emit SSI = (distance - minClusterDistance)/minClusterDistance reduce(): By clusterID emit (clusterID, clusters sum of SSI /
I can end up being a developer. This is crazy, no one wrote fast like this before. People did this already in my expectation, but they keep them with them for competitive purposes (corporate profit, Kaggle placement, etc.).
The above is formatted as code, but not code. This is an English scheme or pseudo code. stackoverflow made me format this section as code in order to accept it.