How do you visualize a parish tree from sklearn.cluster.ward_tree?

Sklearn implements one agglomeration clustering algorithm, the arrival method minimizes dispersion. Sklearn is usually documented with many good use cases, but I could not find examples of how to use this function.

Basically my problem is to draw the dendrogram according to the clustering of my data, but I do not understand the result of the work. The documentation says that it returns the children, the number of components, the number of sheets and parents of each node.

However, for my sample data, the results are not meaningful. For the matrix (32 542), which was grouped with the connectivity matrix, this is the output:

>>> wt = ward_tree(mymat, connectivity=connectivity, n_clusters=2) >>> mymat.shape (32, 542) >>> wt (array([[16, 0], [17, 1], [18, 2], [19, 3], [20, 4], [21, 5], [22, 6], [23, 7], [24, 8], [25, 9], [26, 10], [27, 11], [28, 12], [29, 13], [30, 14], [31, 15], [34, 33], [47, 46], [41, 40], [36, 35], [45, 44], [48, 32], [50, 42], [38, 37], [52, 43], [54, 39], [53, 51], [58, 55], [56, 49], [60, 57]]), 1, 32, array([32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 53, 48, 48, 51, 51, 55, 55, 57, 50, 50, 54, 56, 52, 52, 49, 49, 53, 60, 54, 58, 56, 58, 57, 59, 60, 61, 59, 59, 61, 61])) 

In this case, I asked for two clusters with 32 vectors containing functions. But how are two clusters visible in the data? Where are they? And what do children really mean, how can children be taller than the total number of samples?

+8
python scikit-learn machine-learning hierarchical-clustering
source share
1 answer

About the first argument of the output, the documentation says

Children of each non-leaf node. Values ​​less than n_samples to tree leaves. The larger value I point to is node with children of children [i - n_samples].

I had some problems understanding what this means, but then this code helped. We generate normally distributed data with two β€œclusters”, one with three data points with an average value of 0 and one with two data points with an average value of 100. Thus, we expect the first first data point to be in one branch of the output tree and the other 2 to another.

 from sklearn.cluster import ward_tree import numpy as np import itertools X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100]) w = ward_tree(X) ii = itertools.count(w[2]) [{'node_id': next(ii), 'left': x[0], 'right':x[1]} for x in w[0]] 

What the tree creates:

 [{'node_id': 5, 'right': 2, 'left': 1}, {'node_id': 6, 'right': 4, 'left': 3}, {'node_id': 7, 'right': 5, 'left': 0}, {'node_id': 8, 'right': 7, 'left': 6}] 

where are the numbers node id. If node_id <5 (number of samples), then this is the index to the data point (or leaf node). If node_id> = 5, then this is an internal node. We see the data clusters as expected:

  8 / \ 7 \ / \ \ 5 \ 6 / \ \ / \ 1 2 0 3 4 
+5
source share

All Articles