How does simple random sampling and the DataPrame SAMPLE function work in Apache Spark (Scala)?

Question

How does simple random sampling and the DataPrame SAMPLE function work in Apache Spark (Scala)?

Q1. I am trying to get a simple random sample from the Spark framework (13 lines) using an example function with the parameters withReplacement: false, fraction: 0.6, but with each run I get samples of different sizes, although it works fine when I set the third parameter (seed) . Why is that?

Q2. How is the sample obtained after generating random numbers?

Thanks in advance

+2

scala dataframe apache-spark pyspark apache-spark-sql

Yogesh Aug 26 '15 at 14:52

source share

2 answers

The RDD API includes takeSample , which will return a "sample of the specified size in the array." It works by calling sample until it gets a sample size different from the requested one, and then randomly typing the specified number. The code comments that it doesn’t need to be repeated often due to bias towards large sample sizes.

0

Glenn Oct 24 '17 at 10:42

source share

zero323 · Accepted Answer · 2015-08-26T15:44:47+0000

How is the sample obtained after generating random numbers?

Depending on the fraction you want to select, there are two different algorithms. You can check Justin Pihony for SPARK. Is the sample method for uniformly sampling Dataframes?

it gives me samples of different sizes every time I run it, although it works fine when I set the third parameter (seed). Why is that?

If the fraction is higher than RandomSampler.defaultMaxGapSamplingFraction , the selection is performed using a simple filter :

 items.filter { _ => rng.nextDouble() <= fraction }

otherwise, simplifying some things, he repeatedly calls the drop method using random integers and takes the next element.

Keeping this in mind, it should be obvious that the number of returned elements will be random with an average value, assuming that there is nothing bad in the GapSamplingIterator equal to the fraction * rdd.count. If you set the seed, you get the same sequence of random numbers, and as a result, the same elements are included in the pattern.

How does simple random sampling and the DataPrame SAMPLE function work in Apache Spark (Scala)?

More articles: