Modified Fisher-Yeiss Algorithm
The shuffle solution can be improved since you only need to shuffle the first k elements of the array. But this is still O (n), because a naive implementation of shuffling requires an array of size n, which must be initialized to n values from 0 to n-1.
Initialize value[n] to {0..n-1} For i from 0 to k-1: swap(value[i], value[random_in_range(i, n)]) Result is value[0..k-1]
To improve this, we can use some kind of virtual array consisting of two parts:
value: an array of the first k elements that will be the resulting selection. This is initialized {0..k-1}
rest: a rare data structure (such as a hash table) with a capacity of k entries that contains all the other entries in the array that are not just their own index. Originally empty.
Now we can define a function that changes the elements i and j from an array of values (Note: i <k, as guaranteed by the above algorithm):
# To swap elements i and j If j < k: # Both elements to be swapped are in the selection tmp = value[i]; value[i] = value[j]; value[j] = tmp Else If j in rest: # Element j has been swapped before tmp = value[i]; value[i] = rest[j]; rest[j] = tmp Else: # The value at j is still j, we now add it to the virtual array rest[j] = value[i]; value[i] = j
For this, O (k) space and time are used for any value of k & le; n
Three Algorithms Strategy
A simpler solution using O (k) memory is to simply save the hash table of all selected values and generate values until the hash table contains the values of k, rejecting duplicates.
For small k, the probability that a randomly selected item is a duplicate is negligible, and a naive hash table is by far the easiest solution. On the other hand, if k is a significant fraction of n, the hash table is basically wasted space, since a simple bit vector of size n will be enough to record which values were visible. And for very large k, the deviation algorithm will take too long when the sample is full, and the full vector needed for shuffling is not much more space than the vector that will be used to store the sample.
As a result, the pragmatic solution should probably use the fact that of the three solutions it uses less space and time: for values of k large enough so that the n-bit bitvector is smaller than a hash table with k elements, but not so much that the probability of deviation significant (say n / 64? k? n / 4), use a bitvector. For smaller k values, use a simple hash table algorithm, and for k values close to n, use Shuffle Fisher-Yates on a full n-element vector (but limited to k steps).
Since we choose O (n) strategies only in cases where k> c n for some constant c , the composite algorithm is still equal to O (k) time and space, because with this restriction n is in O (k).