One possible approach is to add random keys using mapParitions
import os import numpy as np swap = lambda x: (x[1], x[0]) def add_random_key(it):
Then you can redo, sort each section and extract the values:
n = rdd.getNumPartitions() (rdd_with_keys
If sorting for each section still slows down, you can replace it with Shuffle Fisher-Yates.
If you just need random data, you can use mllib.RandomRDDs
from pyspark.mllib.random import RandomRDDs RandomRDDs.uniformRDD(sc, n)
Theoretically, it can be archived using rdd input, but for this you need to compare the number of elements per section.
source share