Sorting in Spark is a multiphase process that requires mixing:
- a sample of the input RDD, and this sample is used to compute the boundaries for each output section (
sample followed by collect ) - the input SDR is split using a
rangePartitioner with the boundaries calculated in the first step ( partitionBy ) - each section of the second stage is sorted locally (
mapPartitions )
When the data is collected, it remains only to follow the order defined by the separator.
The above steps are clearly reflected in the debug line:
scala> val rdd = sc.parallelize(Seq(4, 2, 5, 3, 1)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at ... scala> rdd.sortBy(identity).toDebugString res1: String = (6) MapPartitionsRDD[10] at sortBy at <console>:24 [] // Sort partitions | ShuffledRDD[9] at sortBy at <console>:24 [] // Shuffle +-(8) MapPartitionsRDD[6] at sortBy at <console>:24 [] // Pre-shuffle steps | ParallelCollectionRDD[0] at parallelize at <console>:21 [] // Parallelize
source share