When I execute the command below:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist() rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22 scala> rdd.partitions.size res9: Int = 10 scala> rdd.partitioner.isDefined res10: Boolean = true scala> rdd.partitioner.get res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner@a
It states that there are 10 partitions, and the partitioning is done using the HashPartitioner . But when I execute the command below:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4) ... scala> rdd.partitions.size res6: Int = 4 scala> rdd.partitioner.isDefined res8: Boolean = false
It states that there are 4 sections and the delimiter is not defined. So, what is the default splitting scheme in Spark? / How are the data shared in the second case?
partitioning apache-spark rdd
Dinesh Sachdev 108 Dec 28 '15 at 9:53 2015-12-28 09:53
source share