How does Spark reach sort order?

Suppose I have a list of strings. I filter and sort them and collect the result for the driver. However, everything is distributed, and each RDD has its own part of the source list. So, how does Spark achieve a final ordered order, does it combine the results?

+14
source share
1 answer

Sorting in Spark is a multiphase process that requires mixing:

  1. a sample of the input RDD, and this sample is used to compute the boundaries for each output section ( sample followed by collect )
  2. the input SDR is split using a rangePartitioner with the boundaries calculated in the first step ( partitionBy )
  3. each section of the second stage is sorted locally ( mapPartitions )

When the data is collected, it remains only to follow the order defined by the separator.

The above steps are clearly reflected in the debug line:

 scala> val rdd = sc.parallelize(Seq(4, 2, 5, 3, 1)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at ... scala> rdd.sortBy(identity).toDebugString res1: String = (6) MapPartitionsRDD[10] at sortBy at <console>:24 [] // Sort partitions | ShuffledRDD[9] at sortBy at <console>:24 [] // Shuffle +-(8) MapPartitionsRDD[6] at sortBy at <console>:24 [] // Pre-shuffle steps | ParallelCollectionRDD[0] at parallelize at <console>:21 [] // Parallelize 
+22
source

All Articles