The problem is unlikely to be the one you show:
Let n be the size of the distribution and i be the number of invocations for getNextSample. We have i = sum_i (C_i), where C_i is the number of calls to getNextSample, while the set has size i. To find E [C_i], we note that C_i is the time between arrivals of the Poisson process with λ = 1 - i / n, and therefore exponentially distributed with λ. Therefore, E [C_i] = 1 / λ = therefore E [C_i] = 1 / (1 - i / n) <= 1 / (1 - m / n). Therefore, E [I] m / (1 m / n).
That is, a sample set of size m = n / 2 will receive on average less than 2 m = n calls to getNextSample. If it is “slow” and “bypass”, this is most likely because getNextSample is slow. This is actually not surprising given the inappropriate distribution path of this method (because the method will need to iterate over the entire distribution, in order to find a random element).
The following should be faster (if m <0.8 n)
class Distribution<T> { private double[] cummulativeWeight; private T[] item; private double totalWeight; Distribution(Map<T, Double> probabilityMap) { int i = 0; cummulativeWeight = new double[probabilityMap.size()]; item = (T[]) new Object[probabilityMap.size()]; for (Map.Entry<T, Double> entry : probabilityMap.entrySet()) { item[i] = entry.getKey(); totalWeight += entry.getValue(); cummulativeWeight[i] = totalWeight; i++; } } T randomItem() { double weight = Math.random() * totalWeight; int index = Arrays.binarySearch(cummulativeWeight, weight); if (index < 0) { index = -index - 1; } return item[index]; } Set<T> randomSubset(int size) { Set<T> set = new HashSet<>(); while(set.size() < size) { set.add(randomItem()); } return set; } } public class Test { public static void main(String[] args) { int max = 1_000_000; HashMap<Integer, Double> probabilities = new HashMap<>(); for (int i = 0; i < max; i++) { probabilities.put(i, (double) i); } Distribution<Integer> d = new Distribution<>(probabilities); Set<Integer> set = d.randomSubset(max / 2); //System.out.println(set); } }
The expected execution time is O (m / (1 - m / n) * log n). On my computer, a subset of size 500_000 from set 1_000_000 is calculated after about 3 seconds.
As we can see, the expected execution time approaches infinity when m approaches n. If this is a problem (i.e., M> 0.9 n), the following more comprehensive approach should work better:
Set<T> randomSubset(int size) { Set<T> set = new HashSet<>(); while(set.size() < size) { T randomItem = randomItem(); remove(randomItem); // removes the item from the distribution set.add(randomItem); } return set; }
Effective deletion requires a different view for the distribution, such as a binary tree, where each node stores the total weight of the subtree, the root of which it is.
But this is quite complicated, so I would not go along this route if it is known that m is much less than n.