RDD are distributed collections that materialize only on activities. It is not possible to truncate your RDD to a fixed size and get the RDD back (therefore, RDD.take(n) returns Array[T] , like collect )
I want to get the same RDD size regardless of the input size, you can crop elements in each of your sections - this way you can better control the absolute number of elements in the resulting RDD . The size of the resulting RDD will depend on the spark of parallelism.
Example from spark-shell :
import org.apache.spark.rdd.RDD val numberOfPartitions = 1000 val millionRdd: RDD[Int] = sc.parallelize(1 to 1000000, numberOfPartitions) val millionRddTruncated: RDD[Int] = rdd.mapPartitions(_.take(10)) val billionRddTruncated: RDD[Int] = sc.parallelize(1 to 1000000000, numberOfPartitions).mapPartitions(_.take(10)) millionRdd.count
botchniaque
source share