How does simple random sampling and the DataPrame SAMPLE function work in Apache Spark (Scala)?

Q1. I am trying to get a simple random sample from the Spark framework (13 lines) using an example function with the parameters withReplacement: false, fraction: 0.6, but with each run I get samples of different sizes, although it works fine when I set the third parameter (seed) . Why is that?

Q2. How is the sample obtained after generating random numbers?

Thanks in advance

+2
scala dataframe apache-spark pyspark apache-spark-sql
source share
2 answers

How is the sample obtained after generating random numbers?

Depending on the fraction you want to select, there are two different algorithms. You can check Justin Pihony for SPARK. Is the sample method for uniformly sampling Dataframes?

it gives me samples of different sizes every time I run it, although it works fine when I set the third parameter (seed). Why is that?

If the fraction is higher than RandomSampler.defaultMaxGapSamplingFraction , the selection is performed using a simple filter :

 items.filter { _ => rng.nextDouble() <= fraction } 

otherwise, simplifying some things, he repeatedly calls the drop method using random integers and takes the next element.

Keeping this in mind, it should be obvious that the number of returned elements will be random with an average value, assuming that there is nothing bad in the GapSamplingIterator equal to the fraction * rdd.count. If you set the seed, you get the same sequence of random numbers, and as a result, the same elements are included in the pattern.

+1
source share

The RDD API includes takeSample , which will return a "sample of the specified size in the array." It works by calling sample until it gets a sample size different from the requested one, and then randomly typing the specified number. The code comments that it doesn’t need to be repeated often due to bias towards large sample sizes.

0
source share

All Articles