Clustering 3D structural data

Let's say I have a number of objects (similar to proteins, but not exactly), each of which is represented by a vector n of 3D coordinates. Each of these objects is oriented somewhere in space. Their similarity can be calculated by aligning them using the Kabsch Algorithm and calculating the standard deviation of the aligned coordinates.

My question is what would be the recommended way to cluster a large set of these structures in such a way as to extract the most populated cluster (i.e. the one to which most structures belong). Also, is there a way to do this in python. As an example, here is a trivialized set of unclassified structures (each of them is represented by the coordinates of four vertices):

enter image description here

And then the desired clustering (using two clusters):

enter image description here

I tried aligning all the structures to a support structure (i.e. the first structure) and then executing K-means at the distances between the link and aligned the coordinates using Pycluster.kcluster , but this seems to be a little clumsy and not working so well. The structures in each cluster are not very similar to each other. Ideally, this clustering will not be performed on difference vectors, but rather on the actual structures themselves, but the structures have dimensions (n, 3), and not (n), necessary for clustering k-means.

Another option I've tried is scipy.clustering.hierarchical . This seems to work very well, but it’s hard for me to decide which cluster is the most popular, since you can always find a more populated cluster by moving on to the next tree branch.

Any thoughts or suggestions or ideas about various (already implemented in python) clustering algorithms will be greatly appreciated.

+7
python cluster-analysis
source share
1 answer

To give an introductory answer to my own question, I suggest using the list of distances between each point in the form as a metric for clustering.

Let me create some shapes:

 shapes = np.array([[[1,4],[4,2],[11,2],[14,0]], [[4,5],[4,2],[11,2],[13,0]], [[1,3],[4,2],[11,2],[14,1.5]], [[3,5],[4,2],[10,7],[7,9]], [[5,5],[4,2],[10,7],[6,6]]]) def random_rotation(): theta = 3 * np.pi * np.random.random() rotMatrix = numpy.array([[np.cos(theta), -np.sin(theta)], [np.sin(theta), np.cos(theta)]]) return rotMatrix new_shapes = [] for s in shapes: rr = random_rotation() new_shapes += [[list(rr.dot(p)) + [0] for p in s]] new_shapes = np.array(new_shapes) for i, s in enumerate(new_shapes): plot(s[:,0], s[:,1], color='black') text(np.mean(s[:,0]), np.mean(s[:,1]), str(i), fontsize=14) 

enter image description here

Then we create some helper functions and create an array containing all the inter-vertex distances for each shape ( darray ).

 import itertools as it def vec_distance(v1, v2): ''' The distance between two vectors. ''' diff = v2 - v1 return math.sqrt(sum(diff * diff)) def distances(s): ''' Compute the distance array for a shape s. ''' ds = [vec_distance(p1, p2) for p1,p2 in it.combinations(s, r=2)] return np.array(ds) # create an array of inter-shape distances for each shape darray = np.array([distances(s) for s in new_shapes]) 

Group them in two clusters using Pycluster .

 import Pycluster as pc clust = pc.kcluster(darray,2) print clust 

And let's see that we get three records in the first cluster, and two in the other.

 (array([0, 0, 0, 1, 1], dtype=int32), 4.576996142441375, 1) 

But what forms do they correspond to?

 import brewer2mpl dark2 = brewer2mpl.get_map('Dark2', 'qualitative', 4).mpl_colors for i,(s,c) in enumerate(zip(new_shapes, clust[0])): plot(s[:,0], s[:,1], color=dark2[c]) text(np.mean(s[:,0]), np.mean(s[:,1]), str(i), fontsize=14) 

Clustered shapes

Looks nice! The problem is that as the size increases, the dimensional array grows in quadratic time with respect to the number of vertices. I found a presentation that describes this problem and offers some solutions (e.g. SVD for what I believe is a form of dimensional reduction) to speed it up.

I am not going to accept my answer yet, because I am interested in any other ideas or thoughts on how to approach this simple problem.

+1
source share

All Articles