The main language of the project seems to be an important factor.
If pyspark is a good way to use Spark for you (which means you are accessing Spark from Python), accessing R through rpy2 should not be much different than using any other Python library with a C extension.
There are reports that users are doing this (although with random questions, such as How can I split the pyspark RDD where the R functions are stored or Can I connect an external (R) process to every pyspark worker during installation )
If R is your main language, helping SparkR authors with feedback or contributions in which you feel that there are limitations, you could go.
If your main language is Scala, rscala should be your first try.
While the pyspark + rpy2 will seem to be the most “installed” (as in “uses the oldest and possibly most frequently used code base”), this does not necessarily mean that it is the best solution (and young packages can develop quickly) . First, I would appreciate what is the preferred language for the project, and try the options from there.
lgautier
source share