Unable to get scipy hierarchical clustering to work

I wrote a simple script that is designed for hierarchical clustering on a simple test dataset. The test data that was used.

I found the fclusterdata function to become a candidate for clustering my data in two clusters. Two required call parameters are required: dataset and threshold. The problem is that I could not find the threshold value that would give the expected two clusters.

I would be glad if someone tells me what I'm doing wrong. I would also be happy if someone could point out other approaches that are better suited for my clustering (I clearly want to avoid specifying the number of clusters in advance.)

Here is my code:

import time import scipy.cluster.hierarchy as hcluster import numpy.random as random import numpy import pylab pylab.ion() data = random.randn(2,200) data[:100,:100] += 10 for i in range(5,15): thresh = i/10. clusters = hcluster.fclusterdata(numpy.transpose(data), thresh) pylab.scatter(*data[:,:], c=clusters) pylab.axis("equal") title = "threshold: %f, number of clusters: %d" % (thresh, len(set(clusters))) print title pylab.title(title) pylab.draw() time.sleep(0.5) pylab.clf() 

Here is the result:

 threshold: 0.500000, number of clusters: 129 threshold: 0.600000, number of clusters: 129 threshold: 0.700000, number of clusters: 129 threshold: 0.800000, number of clusters: 75 threshold: 0.900000, number of clusters: 75 threshold: 1.000000, number of clusters: 73 threshold: 1.100000, number of clusters: 58 threshold: 1.200000, number of clusters: 1 threshold: 1.300000, number of clusters: 1 threshold: 1.400000, number of clusters: 1 
+7
source share
1 answer

Please note that help has an error. Correct definition of the parameter t : "Cutoff threshold for the cluster function or maximum number of clusters (criterion = maxclust)."

So try the following:

 clusters = hcluster.fclusterdata(numpy.transpose(data), 2, criterion='maxclust', metric='euclidean', depth=1, method='centroid') 
+6
source

All Articles