Work with large data sets when using mclust is described in the Technical Report , subsection 11.1.
In short, the mclust and mclustBIC include a provision for using data subsampling in the hierarchical phase of clustering before applying the EM to a complete dataset to extend the method to larger datasets.
General example:
library(mclust) set.seed(1) ## ## Data generation ## N <- 5e3 df <- data.frame(x=rnorm(N)+ifelse(runif(N)>0.5,5,0), y=rnorm(N,10,5)) ## ## Full set ## system.time(res <- Mclust(df)) # > user system elapsed # > 66.432 0.124 67.439 ## ## Subset for initial stage ## M <- 1e3 system.time(res <- Mclust(df, initialization=list(subset=sample(1:nrow(df), size=M)))) # > user system elapsed # > 19.513 0.020 19.546
The "Subsetted" version runs about 3.5 times faster on my dual-core processor (although mclust only uses a single-core processor).
When N<-5e4 (as in your example) and M<-1e3 , it took about 3.5 minutes for a version with a subset.
source share