If we always join one RDD (say rdd1) with all the others, the idea is to split that RDD and save it later.
Here is an implementation of sudo-Scala (can be easily converted to Python or Java):
val rdd1 = sc.textFile(file1).mapToPair(..).partitionBy(new HashPartitioner(200)).cache()
Prior to this, we have rdd1 for hashing in 200 partitions. The first time it is evaluated, it will be saved (cached).
Now let's read two more rdds and join them.
val rdd2 = sc.textFile(file2).mapToPair(..) val join1 = rdd1.join(rdd2).map(peopleObject) val rdd3 = sc.textFile(file3).mapToPair(..) val join2 = rdd1.join(rdd3).map(peopleObject)
Note that for repair RDDs, we do not break them or cache them.
Spark will see that rdd1 already hashed a partition and it will use the same partitions for all other joins. Thus, rdd2 and rdd3 will mix their keys to the same places where the rdd1 keys are located.
To make this clearer, let's say that we are not dealing with the section, and we use the same code that is indicated in the question; Each time we make a connection, both rdds will be shuffled. This means that if we attach N to rdd1, the partitionless version will shuffle rdd1 N times. The split approach will shuffle rdd1 only once.
source share