Performing operations only on a subset of RDD

Question

Performing operations only on a subset of RDD

I would like to perform some conversions only on a subset of RDD (to speed up the experiment in REPL).

Is it possible?

RDD has a take(num: Int): Array[T] method take(num: Int): Array[T] , I think I need something similar, but returning RDD [T]

+7

apache-spark

Arkadiusz Komarzewski May 11 '14 at 15:48

source share

4 answers

RDD are distributed collections that materialize only on activities. It is not possible to truncate your RDD to a fixed size and get the RDD back (therefore, RDD.take(n) returns Array[T] , like collect )

I want to get the same RDD size regardless of the input size, you can crop elements in each of your sections - this way you can better control the absolute number of elements in the resulting RDD . The size of the resulting RDD will depend on the spark of parallelism.

Example from spark-shell :

 import org.apache.spark.rdd.RDD val numberOfPartitions = 1000 val millionRdd: RDD[Int] = sc.parallelize(1 to 1000000, numberOfPartitions) val millionRddTruncated: RDD[Int] = rdd.mapPartitions(_.take(10)) val billionRddTruncated: RDD[Int] = sc.parallelize(1 to 1000000000, numberOfPartitions).mapPartitions(_.take(10)) millionRdd.count // 1000000 millionRddTruncated.count // 10000 = 10 item * 1000 partitions billionRddTruncated.count // 10000 = 10 item * 1000 partitions

+1

botchniaque Jun 12 '17 at 16:49

source share

It is apparently possible to create a subset of RDD, first using its take method, and then passing the returned SparkContext array makeRDD makeRDD[T](seq: Seq[T], numSlices: Int = defaultParallelism) , which returns a new RDD.

This approach seems to me a trick. Is there a better way?

0

Arkadiusz Komarzewski May 11 '14 at 16:13

source share

I always use the parallelize SparkContext function to distribute from Array [T], but it seems that makeRDD does the same. This is the right way to both of them.

0

gasparms May 11 '14 at 16:17

source share

Sean owen · Accepted Answer · 2014-05-11T16:31:41+0000

You can use RDD.sample to get RDD , not Array . For example, for a sample of ~ 1% without replacement:

 val data = ... data.count ... res1: Long = 18066983 val sample = data.sample(false, 0.01, System.currentTimeMillis().toInt) sample.count ... res3: Long = 180190

The third parameter is the seed and, fortunately, not necessarily in the next version of Spark.

Performing operations only on a subset of RDD

More articles: