Pyspark: Shuffle RDD

Question

Pyspark: Shuffle RDD

I am trying to randomize the order of elements in RDD. My current approach is to pin elements using RDD shuffled integers, and then join those integers.

However, pyspark crashes with only 100,000,000 integers. I am using the code below.

My question is: is there a better way to either zip up with a random index, or shuffle otherwise?

I tried sorting with a random key that works, but is slow.

def random_indices(n): """ return an iterable of random indices in range(0,n) """ indices = range(n) random.shuffle(indices) return indices

In pyspark, the following happens:

 Using Python version 2.7.3 (default, Jun 22 2015 19:33:41) SparkContext available as sc. >>> import clean >>> clean.sc = sc >>> clean.random_indices(100000000) Killed

+6

python hadoop bigdata apache-spark pyspark

Marcin Aug 19 '15 at 10:41

source share

1 answer

zero323 · Accepted Answer · 2015-08-19T23:25:40+0000

One possible approach is to add random keys using mapParitions

 import os import numpy as np swap = lambda x: (x[1], x[0]) def add_random_key(it): # make sure we get a proper random seed seed = int(os.urandom(4).encode('hex'), 16) # create separate generator rs = np.random.RandomState(seed) # Could be randint if you prefer integers return ((rs.rand(), swap(x)) for x in it) rdd_with_keys = (rdd # It will be used as final key. If you don't accept gaps # use zipWithIndex but this should be cheaper .zipWithUniqueId() .mapPartitions(add_random_key, preservesPartitioning=True))

Then you can redo, sort each section and extract the values:

 n = rdd.getNumPartitions() (rdd_with_keys # partition by random key to put data on random partition .partitionBy(n) # Sort partition by random value to ensure random order on partition .mapPartitions(sorted, preservesPartitioning=True) # Extract (unique_id, value) pairs .values())

If sorting for each section still slows down, you can replace it with Shuffle Fisher-Yates.

If you just need random data, you can use mllib.RandomRDDs

 from pyspark.mllib.random import RandomRDDs RandomRDDs.uniformRDD(sc, n)

Theoretically, it can be archived using rdd input, but for this you need to compare the number of elements per section.

Pyspark: Shuffle RDD

More articles: