It depends on what you mean when talking about RDD . Strictly speaking, RDD is simply a description of a line that exists only on the driver, and it does not provide any methods that can be used to change its line.
When data is processed, we can no longer talk about RDD, but tasks, however, data are displayed using immutable data structures ( scala.collection.Iterator in Scala, itertools.chain in Python).
So far so good. Unfortunately, the immutability of the data structure does not imply the immutability of the stored data. Let's create a small example to illustrate this:
val rdd = sc.parallelize(Array(0) :: Array(0) :: Array(0) :: Nil) rdd.map(a => { a(0) +=1; a.head }).sum // Double = 3.0
You can do this as many times as you want and get the same result. Now let's cache RDD and repeat the whole process:
rdd.cache rdd.map(a => { a(0) +=1; a.head }).sum // Double = 3.0 rdd.map(a => { a(0) +=1; a.head }).sum // Double = 6.0 rdd.map(a => { a(0) +=1; a.head }).sum // Double = 9.0
Since the function that we use in the first map is not clean and modifies its mutable argument, these changes accumulate with each execution and lead to unpredictable results. For example, if the RDD out of cache, we can get 3.0 again. If some sections are not cached, you can mix the results.
PySpark provides stronger isolation and results, as this is not possible, but this is not a matter of architecture.
Distract the message here that you should be very careful when working with mutable data and avoid any modifications unless explicitly allowed ( fold , aggregate ).
source share