Using R in Apache Spark

There are several options for accessing R-libraries in Spark:

  • directly using sparkr
  • using language bindings like rpy2 or rscala
  • using a standalone service like opencpu

It seems that SparkR is quite limited, OpenCPU requires the preservation of additional services, and bindings can have a stability problem. Is there anything special about Spark architecture that makes using any solution difficult.

Do you have any experience integrating R and Spark that you can share?

+7
r distributed-computing apache-spark rpy2 opencpu
source share
1 answer

The main language of the project seems to be an important factor.

If pyspark is a good way to use Spark for you (which means you are accessing Spark from Python), accessing R through rpy2 should not be much different than using any other Python library with a C extension.

There are reports that users are doing this (although with random questions, such as How can I split the pyspark RDD where the R functions are stored or Can I connect an external (R) process to every pyspark worker during installation )

If R is your main language, helping SparkR authors with feedback or contributions in which you feel that there are limitations, you could go.

If your main language is Scala, rscala should be your first try.

While the pyspark + rpy2 will seem to be the most “installed” (as in “uses the oldest and possibly most frequently used code base”), this does not necessarily mean that it is the best solution (and young packages can develop quickly) . First, I would appreciate what is the preferred language for the project, and try the options from there.

+4
source share

All Articles