Data output of kmen emissions (PyCluster impl)

Question

Data output of kmen emissions (PyCluster impl)

How on graphical output of kmeans clusters in python? I am using the PyCluster package. allUserVector is an n-dimensional dimensional vector, mostly n users with m-functions.

import Pycluster as pc import numpy as np clusterid,error,nfound = pc.kcluster(allUserVector, nclusters=3, transpose=0,npass=1,method='a',dist='e') clustermap, _, _ = pc.kcluster( allUserVector, nclusters=3, transpose=0,npass=1,method='a',dist='e', ) centroids, _ = pc.clustercentroids( allUserVector, clusterid=clustermap ) print centroids print clusterid print nfound

I want to beautifully print clusters on a graph that clearly shows the clusters, which users are in the cluster. Each user is an m-dimensional vector. Any inputs?

+8

python cluster-analysis k-means

Maxwell Mar 23 '12 at 22:01

source share

1 answer

Dougal · Answer 1 · 2012-03-24T04:39:18+0000

It is hard to build m dimensional data. One way to do this is to map to 2d space through Basic Component Analysis (PCA) . Once we do this, we can throw them on the plot with matplotlib (based on this answer ).

 import numpy as np import matplotlib.pyplot as plt from matplotlib import mlab import Pycluster as pc # make fake user data users = np.random.normal(0, 10, (20, 5)) # cluster clusterid, error, nfound = pc.kcluster(users, nclusters=3, transpose=0, npass=10, method='a', dist='e') centroids, _ = pc.clustercentroids(users, clusterid=clusterid) # reduce dimensionality users_pca = mlab.PCA(users) cutoff = users_pca.fracs[1] users_2d = users_pca.project(users, minfrac=cutoff) centroids_2d = users_pca.project(centroids, minfrac=cutoff) # make a plot colors = ['red', 'green', 'blue'] plt.figure() plt.xlim([users_2d[:,0].min() - .5, users_2d[:,0].max() + .5]) plt.ylim([users_2d[:,1].min() - .5, users_2d[:,1].max() + .5]) plt.xticks([], []); plt.yticks([], []) # numbers aren't meaningful # show the centroids plt.scatter(centroids_2d[:,0], centroids_2d[:,1], marker='o', c=colors, s=100) # show user numbers, colored by their cluster id for i, ((x,y), kls) in enumerate(zip(users_2d, clusterid)): plt.annotate(str(i), xy=(x,y), xytext=(0,0), textcoords='offset points', color=colors[kls])

If you want to build something other than numbers, just change the first argument to annotate . For example, you could do usernames or something like that.

Note that clusters may look a bit “wrong” in this space (for example, 15 seems closer to red than green below), because this is not the actual space in which the clustering occurred. In this case, the first two participant components retain 61% of the variance:

 >>> np.cumsum(users_pca.fracs) array([ 0.36920636, 0.61313708, 0.81661401, 0.95360623, 1. ])

Data output of kmen emissions (PyCluster impl)

More articles: