What parallel algorithms exist in R working on big data

Question

What parallel algorithms exist in R working on big data

I am trying to find out what statistical / intelligence algorithms in R or R packets in CRAN / github / R-Forge exist that can process large data arrays either in parallel on one server or sequentially, memory problems or which work on several machines simultaneously . This is to evaluate if I can easily transfer them to work with ff / ffbase, for example ffbase :: bigglm.ffdf.

I would like to break them into 3 parts:

Algorithms that parallel or parallel perform parallel parameter calculations
- Buckshot ( https://github.com/lianos/buckshot )
- lm.fit @ Programming for Big Data ( https://github.com/RBigData )
Algorithms that work sequentially (receive data in R, but use only one process and only 1 process updates the parameters)
- bigglm ( http://cran.r-project.org/web/packages/biglm/index.html )
- Compound linear Poisson models ( http://cran.r-project.org/web/packages/cplm/index.html )
- Kmeans @biganalytics ( http://cran.r-project.org/web/packages/biganalytics/index.html )
Work on a piece of data
- Distributed Text Processing ( http://www.jstatsoft.org/v51/i05/paper )

And I would like to exclude simple parallelization, such as hyperparameter optimization, for example. crossvalidating. Any other pointers to such models / optimizers or algorithms? Maybe Bayesian? Perhaps a package called RGraphlab (http://graphlab.org/)?

+4

algorithm memory r machine-learning bigdata

jwijffels Nov 26 '12 at 17:21

source share

2 answers

Zach · Answer 1 · 2012-11-27T15:17:03+0000

Random forest trivial for parallel work. This is one example in the foreach vignette :

x <- matrix(runif(500), 100) y <- gl(2, 50) library(randomForest); library(foreach) rf <- foreach(ntree=rep(250, 4), .combine=combine, .packages='randomForest') %dopar% randomForest(x, y, ntree=ntree)

You can use this construct to split the forest into each core in your cluster.

Greg snow · Answer 2 · 2012-11-26T20:01:23+0000

Have you read High Performance Computing View Task on CRAN?

It covers many of the points you mentioned and provides an overview of packages in these areas.

What parallel algorithms exist in R working on big data

More articles: