What parallel algorithms exist in R working on big data

I am trying to find out what statistical / intelligence algorithms in R or R packets in CRAN / github / R-Forge exist that can process large data arrays either in parallel on one server or sequentially, memory problems or which work on several machines simultaneously . This is to evaluate if I can easily transfer them to work with ff / ffbase, for example ffbase :: bigglm.ffdf.

I would like to break them into 3 parts:

And I would like to exclude simple parallelization, such as hyperparameter optimization, for example. crossvalidating. Any other pointers to such models / optimizers or algorithms? Maybe Bayesian? Perhaps a package called RGraphlab (http://graphlab.org/)?

+4
source share
2 answers

Random forest trivial for parallel work. This is one example in the foreach vignette :

x <- matrix(runif(500), 100) y <- gl(2, 50) library(randomForest); library(foreach) rf <- foreach(ntree=rep(250, 4), .combine=combine, .packages='randomForest') %dopar% randomForest(x, y, ntree=ntree) 

You can use this construct to split the forest into each core in your cluster.

+2
source

Have you read High Performance Computing View Task on CRAN?

It covers many of the points you mentioned and provides an overview of packages in these areas.

+1
source

All Articles