Clustering a billion elements (or what kind of clustering methods are performed in linear time?)

I have a billion feature vectors, and I would like to put them in approximate clusters. Looking at the methods from http://scikit-learn.org/stable/modules/clustering.html#clustering , for example, it’s not entirely clear to me how their runtime scales with the size of the data (except for Affinity Distribution, which is clearly too slow) .

What methods are suitable for clustering such a large dataset? I assume that any method should work in O (n) time.

+7
python machine-learning
source share
2 answers

K-mean complexity seems reasonable for your data (only 4 components). The hard part is initializing and choosing the number of clusters. You can try various random initializations, but this can take a long time. An alternative is to subquery your data and run a more expensive clustering algorithm such as Affinity Propagation. Then use the solution as init for k-tools and run it with all your data.

+3
source share

For a billion object vectors, I would doubt that I use K-tools myself. I am sure that you could do this, but it will take a lot of time and therefore it will be difficult to debug. I recommend using Canopy Clustering first and then applying K-tools to reduce complexity and computation. Then these subclusters can be reduced by using the "Reduce Map" implementation to solve even faster.

+3
source share

All Articles