Spark / Scala: passing RDD to a function

I'm curious what exactly passes RDD functions to Spark.

def my_func(x : RDD[String]) : RDD[String] = { do_something_here } 

Suppose we define a function as described above. When we call a function and pass an existing RDD [String] object as an input parameter, does this make my_function a “copy” for this RDD as a function parameter? In other words, is this a link by reference or called by value?

+7
scala apache-spark rdd
source share
2 answers

Nothing is copied to Scala (in the sense of pass-by-value, which you have in C / C ++) during transmission. Most of the basic types are Int, String, Double, etc. They are immutable, therefore their transfer by reference is very safe. (Note: If you pass a mutable object and you modify it, then anyone who refers to this object will see the change).

In addition, RDDs are lazy, distributed, immutable collections. Passing RDD through functions and applying transformation to them (map, filter, etc.) Actually does not pass any data or does not cause any calculations.

All chain conversions are “remembered” and automatically work in the correct order when you perform a forced execution, and action on the RDD, for example, save it or collect it locally from the driver (via collect() , take(n) , etc.)

+12
source share

Spark implements the principle of " send code to data " rather than send data to code. So, this will happen quite the opposite. This is a feature that will be distributed and sent to RDD.

RDDs are immutable, so either your function will create a new RDD as a result (conversion), or it will create some value (action).

An interesting question here: if you define a function, what exactly is sent to RDD (and distributed between different nodes with its transfer cost)? Good explanation here:

http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark

+4
source share

All Articles