Will there be any scenario where Spark RDD cannot satisfy immutability.?

Question

Will there be any scenario where Spark RDD cannot satisfy immutability.?

RDD sparks are built in a consistent, fault tolerant, and resilient mode.

Does RDD ensure consistency in all scenarios? Or is there any case, be it in the streaming or kernel, where the RDD may not correspond to immutability?

+6

apache-spark rdd spark-streaming

Srini Sep 06 '15 at 17:03

source share

2 answers

zero323 · Answer 1 · 2016-03-05T16:38:54+0000

It depends on what you mean when talking about RDD . Strictly speaking, RDD is simply a description of a line that exists only on the driver, and it does not provide any methods that can be used to change its line.

When data is processed, we can no longer talk about RDD, but tasks, however, data are displayed using immutable data structures ( scala.collection.Iterator in Scala, itertools.chain in Python).

So far so good. Unfortunately, the immutability of the data structure does not imply the immutability of the stored data. Let's create a small example to illustrate this:

 val rdd = sc.parallelize(Array(0) :: Array(0) :: Array(0) :: Nil) rdd.map(a => { a(0) +=1; a.head }).sum // Double = 3.0

You can do this as many times as you want and get the same result. Now let's cache RDD and repeat the whole process:

 rdd.cache rdd.map(a => { a(0) +=1; a.head }).sum // Double = 3.0 rdd.map(a => { a(0) +=1; a.head }).sum // Double = 6.0 rdd.map(a => { a(0) +=1; a.head }).sum // Double = 9.0

Since the function that we use in the first map is not clean and modifies its mutable argument, these changes accumulate with each execution and lead to unpredictable results. For example, if the RDD out of cache, we can get 3.0 again. If some sections are not cached, you can mix the results.

PySpark provides stronger isolation and results, as this is not possible, but this is not a matter of architecture.

Distract the message here that you should be very careful when working with mutable data and avoid any modifications unless explicitly allowed ( fold , aggregate ).

Justin pihony · Answer 2 · 2015-09-07T00:32:17+0000

Take this example:

 sc.makeRDD(1 to 100000).map(x=>{ println(x) x + 1 }.collect

If the node fails after the map completes, but the full results are not yet sent back to the driver, then map will be reprogrammed on another machine. The final results will always be the same, since any value calculated twice will be sent only once. However, println will happen twice for some calls. So, yes, the immutability of the DAG itself is guaranteed, but you should still write your code with the assumption that it will run more than once.

Will there be any scenario where Spark RDD cannot satisfy immutability.?

More articles: