The default splitting scheme in Spark

Question

The default splitting scheme in Spark

When I execute the command below:

scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist() rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22 scala> rdd.partitions.size res9: Int = 10 scala> rdd.partitioner.isDefined res10: Boolean = true scala> rdd.partitioner.get res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner@a

It states that there are 10 partitions, and the partitioning is done using the HashPartitioner . But when I execute the command below:

 scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4) ... scala> rdd.partitions.size res6: Int = 4 scala> rdd.partitioner.isDefined res8: Boolean = false

It states that there are 4 sections and the delimiter is not defined. So, what is the default splitting scheme in Spark? / How are the data shared in the second case?

+10

partitioning apache-spark rdd

Dinesh Sachdev 108 Dec 28 '15 at 9:53

source share

1 answer

zero323 · Accepted Answer · 2015-12-28 10:19

You must distinguish between two different things:

partitioning as the distribution of data between partitions depending on the key value, which is limited only by PairwiseRDDs ( RDD[(T, U)] ). This creates a link between the partition and the set of keys that can be found on a particular partition.
partitioning as splitting the input into several sections, where the data is simply divided into pieces containing sequential records to enable distributed computing. The exact logic depends on the specific source, but this is either the number of records or the size of the piece.
In the case of parallelize data is evenly distributed between partitions using indexes. In the case of HadoopInputFormats (e.g. textFile ), it depends on properties like mapreduce.input.fileinputformat.split.minsize / mapreduce.input.fileinputformat.split.maxsize .

Thus, the default splitting scheme is simply absent, since splitting is not applicable to all RDDs. For operations requiring splitting on PairwiseRDD ( aggregateByKey , reduceByKey , etc.), the use of hash splitting is used by default.

The default splitting scheme in Spark

More articles: