Performing operations only on a subset of RDD

I would like to perform some conversions only on a subset of RDD (to speed up the experiment in REPL).

Is it possible?

RDD has a take(num: Int): Array[T] method take(num: Int): Array[T] , I think I need something similar, but returning RDD [T]

+7
apache-spark
source share
4 answers

You can use RDD.sample to get RDD , not Array . For example, for a sample of ~ 1% without replacement:

 val data = ... data.count ... res1: Long = 18066983 val sample = data.sample(false, 0.01, System.currentTimeMillis().toInt) sample.count ... res3: Long = 180190 

The third parameter is the seed and, fortunately, not necessarily in the next version of Spark.

+16
source share

RDD are distributed collections that materialize only on activities. It is not possible to truncate your RDD to a fixed size and get the RDD back (therefore, RDD.take(n) returns Array[T] , like collect )

I want to get the same RDD size regardless of the input size, you can crop elements in each of your sections - this way you can better control the absolute number of elements in the resulting RDD . The size of the resulting RDD will depend on the spark of parallelism.

Example from spark-shell :

 import org.apache.spark.rdd.RDD val numberOfPartitions = 1000 val millionRdd: RDD[Int] = sc.parallelize(1 to 1000000, numberOfPartitions) val millionRddTruncated: RDD[Int] = rdd.mapPartitions(_.take(10)) val billionRddTruncated: RDD[Int] = sc.parallelize(1 to 1000000000, numberOfPartitions).mapPartitions(_.take(10)) millionRdd.count // 1000000 millionRddTruncated.count // 10000 = 10 item * 1000 partitions billionRddTruncated.count // 10000 = 10 item * 1000 partitions 
+1
source share

It is apparently possible to create a subset of RDD, first using its take method, and then passing the returned SparkContext array makeRDD makeRDD[T](seq: Seq[T], numSlices: Int = defaultParallelism) , which returns a new RDD.

This approach seems to me a trick. Is there a better way?

0
source share

I always use the parallelize SparkContext function to distribute from Array [T], but it seems that makeRDD does the same. This is the right way to both of them.

0
source share

All Articles