How to run mclust faster on a dataset of 50,000 records

Question

How to run mclust faster on a dataset of 50,000 records

I'm starting, I'm trying to group a data frame (with 50,000 records) that has 2 functions (x, y) using mclust . However, it is time to run the command (for example, Mclust(XXX.df) or densityMclust(XXX.df) .

Is there a way to execute the command faster? sample code will be helpful.

For your information, I use a 4-processor processor with 6 GB of RAM, it took me about 15 minutes to do the same analysis (clustering) with Weka, using R, the process still works for more than 1.5 hours. I really want to use R for analysis.

+4

r

user1389582 Jan 15 '13 at 12:08

source share

1 answer

redmode · Answer 1 · 2013-01-15T13:25:40+0000

Work with large data sets when using mclust is described in the Technical Report , subsection 11.1.

In short, the mclust and mclustBIC include a provision for using data subsampling in the hierarchical phase of clustering before applying the EM to a complete dataset to extend the method to larger datasets.

General example:

 library(mclust) set.seed(1) ## ## Data generation ## N <- 5e3 df <- data.frame(x=rnorm(N)+ifelse(runif(N)>0.5,5,0), y=rnorm(N,10,5)) ## ## Full set ## system.time(res <- Mclust(df)) # > user system elapsed # > 66.432 0.124 67.439 ## ## Subset for initial stage ## M <- 1e3 system.time(res <- Mclust(df, initialization=list(subset=sample(1:nrow(df), size=M)))) # > user system elapsed # > 19.513 0.020 19.546

The "Subsetted" version runs about 3.5 times faster on my dual-core processor (although mclust only uses a single-core processor).

When N<-5e4 (as in your example) and M<-1e3 , it took about 3.5 minutes for a version with a subset.

How to run mclust faster on a dataset of 50,000 records

More articles: