Spark - scala: random RDD play / split RDD into two random parts

How can I take an rdd spark array and split it into two rdds randomly so that each rdd includes some piece of data (say 97% and 3%).

I thought to shuffle the list, and then shuffledList.take((0.97*rddList.count).toInt)

But how can I shuffle rdd?

Or is there a better way to smash a list?

+8
scala apache-spark rdd
source share
2 answers

I found a simple and quick way to split an array:

 val Array(f1,f2) = data.randomSplit(Array(0.97, 0.03)) 

He will split the data using the weights provided.

+19
source share

You should use the randomSplit method:

 def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] // Randomly splits this RDD with the provided weights. // weights for splits, will be normalized if they don't sum to 1 // returns split RDDs in an array 

Here is its implementation in spark 1.0:

 def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] = { val sum = weights.sum val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _) normalizedCumWeights.sliding(2).map { x => new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](x(0), x(1)),seed) }.toArray } 
+6
source share

All Articles