The moooeeeep answer is recommended for hierarchical clustering. I wanted to talk about how to choose the complexity of clustering.
One way is to calculate the clusters based on different threshold values โโt1, t2, t3, ... and then calculate the metric for the โqualityโ of clustering. The premise is that the quality of clustering with the optimal number of clusters will have the maximum value of the quality indicator.
An example of a quality metric that I have used in the past is Calinski-Harabasz. In short: you calculate the average intercluster distances and divide them by the distances within the cluster. The optimal clustering destination will be the clusters that are most separated from each other, and the clusters that are the โdensestโ.
By the way, you do not need to use hierarchical clustering. You can also use something like k-means, recompute it for each k, and then choose k, which has the highest Calinski-Harabasz score.
Let me know if you need more links and I will comb my hard drive for some documents.
Max Apr 13 2018-12-12T00: 00Z
source share