Generating all sequence indices is usually a bad idea, as this can take a lot of time, especially if the ratio of the numbers to be selected to MAX is small ( O(MAX) prevails in complexity). This gets worse if the ratio of the number to be selected to MAX approaches one, since then removing the selected indices from the sequence of all also becomes expensive (we approach O(MAX^2/2) ). But for small numbers, this usually works well and is not particularly error prone.
Filtering the generated indexes using the collection is also a bad idea, as it takes some time to insert the indexes in the sequence, and progress is not guaranteed, because the same random number can be drawn several times (but for large enough MAX it is unlikely). This may be close to O(kn log^2(n)/2) complexity, ignoring duplicates and assuming that the collection uses the tree to efficiently search (but with a significant constant cost k select tree nodes and, possibly, to balance ).
Another option is to generate random values uniquely from the start, guaranteeing progress. This means that in the first round a random index is generated at [0, MAX] :
items i0 i1 i2 i3 i4 i5 i6 (total 7 items) idx 0 ^^ (index 2)
In the second round, only [0, MAX - 1] generated (since one element is already selected):
items i0 i1 i3 i4 i5 i6 (total 6 items) idx 1 ^^ (index 2 out of these 6, but 3 out of the original 7)
Then the index values need to be adjusted: if the second index falls into the second half of the sequence (after the first index), it must be increased to take into account the gap. We can implement this as a loop, allowing us to choose an arbitrary number of unique elements.
For short sequences, this is a pretty fast O(n^2/2) 2/2) algorithm:
void RandomUniqueSequence(std::vector<int> &rand_num, const size_t n_select_num, const size_t n_item_num) { assert(n_select_num <= n_item_num); rand_num.clear();
Where n_select_num is your 5 and n_number_num is your MAX . n_Rand(x) returns random integers in [0, x] (inclusive). This can be done a little faster if you select a lot of elements (for example, not 5, but 500), using binary search to find the insertion point. To do this, we need to make sure that we meet the requirements.
We will perform a binary search by comparing n + j < rand_num[j] , which is the same as n < rand_num[j] - j . We need to show that rand_num[j] - j is still an ordered sequence for the ordered sequence rand_num[j] . This, fortunately, is easily shown, since the smallest distance between two elements of the original rand_num is equal to one (the generated numbers are unique, so there is always a difference of at least 1). At the same time, if we subtract the indices j from all the elements of rand_num[j] , the differences in the index are equal to 1. Thus, in the “worst” case, we get a constant sequence, but never decrease. Therefore, you can use the binary search, leading to the O(n log(n)) algorithm:
struct TNeedle {
And finally:
void RandomUniqueSequence(std::vector<int> &rand_num, const size_t n_select_num, const size_t n_item_num) { assert(n_select_num <= n_item_num); rand_num.clear();
I tested it on three criteria. Firstly, 3 numbers were selected from 7 points, and a histogram of selected items was accumulated over 10,000 runs:
4265 4229 4351 4267 4267 4364 4257
This shows that each of the 7 elements was selected approximately the same number of times, and there is no explicit bias caused by the algorithm. All sequences were also checked for correctness (uniqueness of the content).
The second test included the selection of 7 numbers from 5000 items. The execution time of several versions of the algorithm was accumulated over 10,000,000 runs. Results are indicated in the comments in the code as b1 . A simple version of the algorithm is a little faster.
In the third standard, 700 numbers from 5000 objects took part. The time of several versions of the algorithm has accumulated again, this time over 10,000 starts. Results are indicated in the comments in the code as b2 . The binary version of the algorithm search is now more than twice as fast as the simple one.
The second method starts faster to select more than 75 positions on my machine (note that the complexity of any algorithm does not depend on the number of elements, MAX ).
It is worth noting that the above algorithms generate random numbers in ascending order. But it would be simple to add another array to which the numbers will be stored in the order in which they were generated, and will be returned instead (at a slight additional cost O(n) ). No need to shuffle the conclusion: it will be much slower.
Please note that the sources are in C ++, I do not have Java on my machine, but the concept should be clear.
EDIT
For fun, I also implemented an approach that generates a list with all 0 .. MAX indices, selects them randomly and removes them from the list to guarantee uniqueness. Since I chose a fairly high MAX (5000), performance is disastrous:
I also implemented the set approach (C ++ collection), which ranks second in the b2 test, which is about 50% slower than the binary search approach. This is understandable, since set uses a binary tree, where the insertion cost is similar to a binary search. The only difference is the chance to get duplicate items, which slows down progress.
Full source code here .