Clustering a billion elements (or what kind of clustering methods are performed in linear time?)

Question

Clustering a billion elements (or what kind of clustering methods are performed in linear time?)

I have a billion feature vectors, and I would like to put them in approximate clusters. Looking at the methods from http://scikit-learn.org/stable/modules/clustering.html#clustering , for example, it’s not entirely clear to me how their runtime scales with the size of the data (except for Affinity Distribution, which is clearly too slow) .

What methods are suitable for clustering such a large dataset? I assume that any method should work in O (n) time.

+7

python machine-learning

eleanora 15 Sep '15 at 19:53

source share

2 answers

Mikael roousson · Answer 1 · 2015-09-22T21:20:08+0000

K-mean complexity seems reasonable for your data (only 4 components). The hard part is initializing and choosing the number of clusters. You can try various random initializations, but this can take a long time. An alternative is to subquery your data and run a more expensive clustering algorithm such as Affinity Propagation. Then use the solution as init for k-tools and run it with all your data.

Aabglov · Answer 2 · 2015-09-23T17:46:46+0000

For a billion object vectors, I would doubt that I use K-tools myself. I am sure that you could do this, but it will take a lot of time and therefore it will be difficult to debug. I recommend using Canopy Clustering first and then applying K-tools to reduce complexity and computation. Then these subclusters can be reduced by using the "Reduce Map" implementation to solve even faster.

Clustering a billion elements (or what kind of clustering methods are performed in linear time?)

More articles: