Random sampling from data quanta while maintaining the original probability distribution

Following my previous question, entitled " Random sampling from a dataset, preserving the original probability distribution ", I want to try from a set of 2000 numbers collected from a dimension. I want to run several tests (I take a maximum of 10 samples in each test), while maintaining the probability distribution in the overall testiong process and in each test (as much as possible). Now, instead of completely random sampling, I break the data into 5 quantiles, and in 10 tests I select 2 data elements from each quantile, using a uniformly random distribution for the data array in each quantile.

The problem with a completely random sampling was that as the data distribution wears a long tail, I get almost the same values ​​in each test. I want small samples of values, some samples of average value and some large values ​​to be presented in each test. So I tried as described.

data density

Fig 1. Graph of density ~ 2k data elements.

This is the R code for calculating quantiles:

q=quantile(data, probs = seq(0, 1, by= 0.1)) 

And then I break the data into 5 quantiles (each as an array) and a sample from each section. For example, I do this in Java:

 public int getRandomData(int quantile) { int data[][] = {1,2,3,4,5} ,{6,7,8,9,10} ,{11,12,13,14,15} ,{16,17,18,19,20} ,{21,22,23,24,25}}; length=data[quantile][].length; Random r=new Random(); int randomInt = r.nextInt(length); return data[quantile][randomInt]; } 

So, do all samples for each test and all tests retain the characteristics of the original distribution, for example, the average value and variance? If not, how to organize sampling to achieve this goal?

0
source share
1 answer

keep characteristics of the initial distribution, for example, mean and variance?

This will be a similar distribution. You may need additional verification to make sure that it meets your requirements, and maybe try again, but this will help you close.

If not, how to organize sampling to achieve this goal?

If you do not have duplication of all data, i.e. double everything, you need to have one of each sample value. This is the only way to get exactly the same distribution.

+1
source

All Articles