How to generate a list of ascending random integers

I have an external collection containing n elements that I want to randomly select from (k) among them, outputting the indices of these elements into some serialized data file. I want indexes to be displayed in strict ascending order, but for duplicates not. Both n and k can be quite large, and it is usually impractical to just store whole arrays in memory of this size.

The first algorithm I came across was to select a random number r [0] from 1 to nk ... and then select consecutive random numbers r [i] from r [i-1] +1 to n-k + i, you only need to store two entries for "r" at any given time. However, a fairly simple analysis shows that the probability of choosing small numbers is not consistent with what would have happened if the whole set had been evenly distributed. For example, if n was a billion and k was half a billion, the probability of choosing the first record with the approach I just described was very tiny (1 in half a billion), where in fact, since half of the records are selected, the first should be selected at 50 % of cases. Even if I use external sorting to sort k random numbers, I will have to discard any duplicates and try again. When k approaches n, the number of repetitions will continue to grow without a guarantee of termination.

I would like to find an O (k) or O (k log k) algorithm for this, if at all possible. The implementation language that I will use is C ++ 11, but the descriptions in pseudo code can be useful.

+7
c ++ sorting algorithm random
source share
7 answers

You can solve this problem recursively in O (k log k) if you break into the middle of your range and arbitrarily choose a probability distribution from the hypergeometric to choose how many values ​​lie above and below the midpoint (i.e., the k values ​​for each subsequence ), and then recursively for each:

int sample_hypergeometric(int n, int K, int N) // samples hypergeometric distribution and // returns number of "successes" where there are n draws without replacement from // a population of N with K possible successes. // Something similar to scipy.stats.hypergeom.rvs in Python. // In this case, "success" means the selected value lying below the midpoint. { std::default_random_engine generator; std::uniform_real_distribution<double> distribution(0.0,1.0); int successes = 0; for(int trial = 0; trial < n; trial++) { if((int)(distribution(generator) * N) < K) { successes++; K--; } N--; } return successes; } select_k_from_n(int start, int k, int n) { if(k == 0) return; if(k == 1) { output start + random(1 to n); return; } // find the number of results below the mid-point: int k1 = sample_hypergeometric(k, n >> 1, n); select_k_from_n(start, k1, n >> 1); select_k_from_n(start + (n >> 1), k - k1, n - (n >> 1)); } 

A sample from the binary distribution can also be used to approximate the hypergeometric distribution using p = (n β†’ 1) / n, the deviation of the samples where k1> (n β†’ 1).

+3
source share

If in practice k has the same order of magnitude as n, then a simple algorithm O (n) may be sufficient:

 assert(k <= n); std::uniform_real_distribution rnd; for (int i = 0; i < n; i++) { if (rnd(engine) * (n - i) < k) { std::cout << i << std::endl; k--; } } 

It generates all ascending sequences with equal probability.

+5
source share

As mentioned in my comment, use std::set<int> to store randomly generated integers, so the resulting container is essentially sorted and contains no duplicates. Example code snippet:

 #include <random> #include <set> int main(void) { std::set<int> random_set; std::random_device rd; std::mt19937 mt_eng(rd()); // min and max of random set range const int m = 0; // min const int n = 100; // max std::uniform_int_distribution<> dist(m,n); // number to generate const int k = 50; for (int i = 0; i < k; ++i) { // only non-previously occurring values will be inserted if (!random_set.insert(dist(mt_eng)).second) --i; } } 
+2
source share

Could you adjust each upstream index selection to compensate for the distortion of probability that you are describing?

IANAS, but I assume that if you choose a random number r between 0 and 1 (you will scale to the full remaining index range after tuning), you can adjust it by calculating r ^ (x) (keeping the range at 0..1, but increasing the probability of smaller numbers), where is x chosen by solving the equation for the probability of the first record?

0
source share

Assuming that you cannot store k random numbers in memory, you will have to generate numbers in a strictly arbitrary order. One way to do this is to create a number from 0 to n / k. Name this number x . The next number you need to create is between x+1 and (nx) / (k-1). Continue this way until you type k numbers.

Basically, you divide the remaining range by the number of remaining values ​​to generate, and then generate the number in the first section of that range.

Example. You want to generate 3 numbers from 0 to 99 inclusive. So you first generate a number from 0 to 33. Say you selected 10.

So now you need a number from 11 to 99. The remaining range consists of 89 values, and you have two values ​​left. So, 89/2 = 44. You need a number from 11 to 54. Say that you have chosen 36.

Your remaining range is from 37 to 99, and you have one number to choose. Therefore, select a number in random order between 37 and 99.

This will not give you a normal distribution, because once you select a number, it is impossible to get a number less than the next time you select it. But that may be enough for your purposes.

This pseudo code shows the main idea.

 pick_k_from_n(n, k) { num_left = k last_k = 0; while num_left > 0 { // divide the remaining range into num_left partitions range_size = (n - last_k) / num_left // pick a number in the first partition r = random(range_size) + last_k + 1 output(r) last_k = r num_left = num_left - 1 } } 

Note that this takes O (k) time and requires O (1) extra space.

0
source share

You can do this in O (k) time using the Floyd algorithm (not Floyd-Warshall, this is the shortest path). The only data structure you need is a 1-bit table, which will tell you if a number has already been selected. Finding a hash table can be O (1), so it will not be a burden and can be stored in memory even for very large n (if n is really huge, you will have to use a b-tree or bloom filter or something).

To select k items from n:

 for j = n-k+1 to n: select random x from 1 to j if x is already in hash: insert j into hash else insert x into hash 

What is it. At the end, your hash table will contain a uniformly selected pattern of k elements from n. Read them in order (you may have to choose the type of hash table that allows this).

0
source share

Here we use the O (k log k + √n) -time algorithm, which uses O (√n) words of space. This can be generalized to O (k + n ^ (1 / c)) - time, O (n ^ (1 / c)) - a spatial algorithm for any integer constant c.

For intuition, imagine a simple algorithm that uses (for example) the Floyd fetch algorithm to generate k from n elements, and then radix sorts them in the √n database. Instead of remembering what the actual samples are, we will make the first pass, where we run the Floyd option, where we remember only the number of samples in each bucket. A second pass for each bucket is in order to randomly reprogram the appropriate number of items from the bucket range. There is short evidence with a conditional probability that this gives a uniform distribution.

 # untested Python code for illustration # b is the number of buckets (eg, b ~ sqrt(n)) import random def first_pass(n, k, b): counts = [0] * b # list of b zeros for j in range(n - k, n): t = random.randrange(j + 1) if t // b >= counts[t % b]: # intuitively, "t is not in the set" counts[t % b] += 1 else: counts[j % b] += 1 return counts 
0
source share

All Articles