How is the sample obtained after generating random numbers?
Depending on the fraction you want to select, there are two different algorithms. You can check Justin Pihony for SPARK. Is the sample method for uniformly sampling Dataframes?
it gives me samples of different sizes every time I run it, although it works fine when I set the third parameter (seed). Why is that?
If the fraction is higher than RandomSampler.defaultMaxGapSamplingFraction , the selection is performed using a simple filter :
items.filter { _ => rng.nextDouble() <= fraction }
otherwise, simplifying some things, he repeatedly calls the drop method using random integers and takes the next element.
Keeping this in mind, it should be obvious that the number of returned elements will be random with an average value, assuming that there is nothing bad in the GapSamplingIterator equal to the fraction * rdd.count. If you set the seed, you get the same sequence of random numbers, and as a result, the same elements are included in the pattern.
zero323
source share