Arbitrary selection of k different numbers in the range

Question

Arbitrary selection of k different numbers in the range

I need to select k elements randomly in the range 0 to n-1 . n can be up to 10 ^ 9. And k can vary from 1 to n-1 . I can do this O (n) times by simply shuffling the array containing the values 0 to n-1 and selecting the first k elements from it. But when k small, this method is inefficient both in time and in memory. Is there any solution O (k) for this problem?

Note. Selected k numbers must be different.

I am thinking of a solution. There are two approaches to this that I can think of. Let R be the returned set.

Select a random value in the range and add it to R Keep doing this until |R| = k |R| = k . This process takes time sum(n/i) for n+1-k <= i <= n and O (k).
Insert 0 into n-1 into the array, shuffle it, take the first k elements from it. This process takes time and space O (n + k).

So, for a given k I can choose the preferred method in O (k) time.

+7

java sorting algorithm random random-sample

crysoberil May 31 '15 at 2:59 pm

source share

1 answer

rici · Accepted Answer · 2015-05-31T15:32:13+0000

Modified Fisher-Yeiss Algorithm

The shuffle solution can be improved since you only need to shuffle the first k elements of the array. But this is still O (n), because a naive implementation of shuffling requires an array of size n, which must be initialized to n values from 0 to n-1.

 Initialize value[n] to {0..n-1} For i from 0 to k-1: swap(value[i], value[random_in_range(i, n)]) Result is value[0..k-1]

To improve this, we can use some kind of virtual array consisting of two parts:

value: an array of the first k elements that will be the resulting selection. This is initialized {0..k-1}
rest: a rare data structure (such as a hash table) with a capacity of k entries that contains all the other entries in the array that are not just their own index. Originally empty.

Now we can define a function that changes the elements i and j from an array of values (Note: i <k, as guaranteed by the above algorithm):

 # To swap elements i and j If j < k: # Both elements to be swapped are in the selection tmp = value[i]; value[i] = value[j]; value[j] = tmp Else If j in rest: # Element j has been swapped before tmp = value[i]; value[i] = rest[j]; rest[j] = tmp Else: # The value at j is still j, we now add it to the virtual array rest[j] = value[i]; value[i] = j

For this, O (k) space and time are used for any value of k & le; n

Three Algorithms Strategy

A simpler solution using O (k) memory is to simply save the hash table of all selected values and generate values until the hash table contains the values of k, rejecting duplicates.

For small k, the probability that a randomly selected item is a duplicate is negligible, and a naive hash table is by far the easiest solution. On the other hand, if k is a significant fraction of n, the hash table is basically wasted space, since a simple bit vector of size n will be enough to record which values were visible. And for very large k, the deviation algorithm will take too long when the sample is full, and the full vector needed for shuffling is not much more space than the vector that will be used to store the sample.

As a result, the pragmatic solution should probably use the fact that of the three solutions it uses less space and time: for values of k large enough so that the n-bit bitvector is smaller than a hash table with k elements, but not so much that the probability of deviation significant (say n / 64? k? n / 4), use a bitvector. For smaller k values, use a simple hash table algorithm, and for k values close to n, use Shuffle Fisher-Yates on a full n-element vector (but limited to k steps).

Since we choose O (n) strategies only in cases where k> c n for some constant c , the composite algorithm is still equal to O (k) time and space, because with this restriction n is in O (k).

Arbitrary selection of k different numbers in the range

Modified Fisher-Yeiss Algorithm

Three Algorithms Strategy

More articles: