Spark - scala: random RDD play / split RDD into two random parts

Question

Spark - scala: random RDD play / split RDD into two random parts

How can I take an rdd spark array and split it into two rdds randomly so that each rdd includes some piece of data (say 97% and 3%).

I thought to shuffle the list, and then shuffledList.take((0.97*rddList.count).toInt)

But how can I shuffle rdd?

Or is there a better way to smash a list?

+8

scala apache-spark rdd

griffon vulture Jul 21 '14 at 12:13

source share

2 answers

You should use the randomSplit method:

 def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] // Randomly splits this RDD with the provided weights. // weights for splits, will be normalized if they don't sum to 1 // returns split RDDs in an array

Here is its implementation in spark 1.0:

 def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] = { val sum = weights.sum val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _) normalizedCumWeights.sliding(2).map { x => new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](x(0), x(1)),seed) }.toArray }

+6

Shyamendra solanki Jul 21 '14 at 13:06

source share

griffon vulture · Accepted Answer · 2014-07-21T13:02:00+0000

I found a simple and quick way to split an array:

 val Array(f1,f2) = data.randomSplit(Array(0.97, 0.03))

He will split the data using the weights provided.

Spark - scala: random RDD play / split RDD into two random parts

More articles: