How to efficiently generate a set of unique random numbers with a predefined distribution?

I have a map of elements with some probability distribution:

Map<SingleObjectiveItem, Double> itemsDistribution; 

For a specific m I need to generate a Set elements m selected from the distribution above.

At the moment, I used a naive way to do this:

 while(mySet.size < m) mySet.add(getNextSample(itemsDistribution)); 

The getNextSample(...) method extracts an object from a distribution according to its probability. Now, when m increases productivity, it suffers a lot. For elements m = 500 and itemsDistribution.size() = 1000 too much interception, and the function remains in the while loop for too long. Create 1000 of these sets, and you have an application that scans.

Is there a more efficient way to create a unique set of random numbers with a “predefined” distribution? Most methods of shuffling a collection, etc. Evenly random. What would be a good way to solve this problem?

UPDATE : the loop will call getNextSample(...) "at least" 1 + 2 + 3 + ... + m = m(m+1)/2 times. That is, in the first run, we will definitely get a sample for the set. The second iteration can be called at least twice and so on. If getNextSample is sequential in nature, i.e. It goes through the entire cumulative distribution to find a sample, then the complexity of the cycle time is not less: n*m(m+1)/2 , 'n' is the number of elements in the distribution. If m = cn; 0<c<=1 m = cn; 0<c<=1 , then the loop is at least Sigma (n ^ 3). And this is also the bottom line!

If we replace sequential search with binary search, the complexity will be at least Sigma (log n * n ^ 2). Effective, but may not be too big.

In addition, deletion from the distribution is not possible, since I call a loop higher than k times to generate k such sets. These sets are part of a randomized "schedule" of items. Therefore, a "set" of elements.

+3
java performance algorithm random
source share
8 answers

The problem is unlikely to be the one you show:

Let n be the size of the distribution and i be the number of invocations for getNextSample. We have i = sum_i (C_i), where C_i is the number of calls to getNextSample, while the set has size i. To find E [C_i], we note that C_i is the time between arrivals of the Poisson process with λ = 1 - i / n, and therefore exponentially distributed with λ. Therefore, E [C_i] = 1 / λ = therefore E [C_i] = 1 / (1 - i / n) <= 1 / (1 - m / n). Therefore, E [I] m / (1 m / n).

That is, a sample set of size m = n / 2 will receive on average less than 2 m = n calls to getNextSample. If it is “slow” and “bypass”, this is most likely because getNextSample is slow. This is actually not surprising given the inappropriate distribution path of this method (because the method will need to iterate over the entire distribution, in order to find a random element).

The following should be faster (if m <0.8 n)

 class Distribution<T> { private double[] cummulativeWeight; private T[] item; private double totalWeight; Distribution(Map<T, Double> probabilityMap) { int i = 0; cummulativeWeight = new double[probabilityMap.size()]; item = (T[]) new Object[probabilityMap.size()]; for (Map.Entry<T, Double> entry : probabilityMap.entrySet()) { item[i] = entry.getKey(); totalWeight += entry.getValue(); cummulativeWeight[i] = totalWeight; i++; } } T randomItem() { double weight = Math.random() * totalWeight; int index = Arrays.binarySearch(cummulativeWeight, weight); if (index < 0) { index = -index - 1; } return item[index]; } Set<T> randomSubset(int size) { Set<T> set = new HashSet<>(); while(set.size() < size) { set.add(randomItem()); } return set; } } public class Test { public static void main(String[] args) { int max = 1_000_000; HashMap<Integer, Double> probabilities = new HashMap<>(); for (int i = 0; i < max; i++) { probabilities.put(i, (double) i); } Distribution<Integer> d = new Distribution<>(probabilities); Set<Integer> set = d.randomSubset(max / 2); //System.out.println(set); } } 

The expected execution time is O (m / (1 - m / n) * log n). On my computer, a subset of size 500_000 from set 1_000_000 is calculated after about 3 seconds.

As we can see, the expected execution time approaches infinity when m approaches n. If this is a problem (i.e., M> 0.9 n), the following more comprehensive approach should work better:

 Set<T> randomSubset(int size) { Set<T> set = new HashSet<>(); while(set.size() < size) { T randomItem = randomItem(); remove(randomItem); // removes the item from the distribution set.add(randomItem); } return set; } 

Effective deletion requires a different view for the distribution, such as a binary tree, where each node stores the total weight of the subtree, the root of which it is.

But this is quite complicated, so I would not go along this route if it is known that m is much less than n.

+1
source share

Start by creating several random points in two flavors.

enter image description here

Then apply your distribution

enter image description here

Now find all the entries in the distribution and select the x coordinates, and you have your random numbers with the requested distribution as follows:

enter image description here

+3
source share

You must implement your own random number generator (using the MonteCarlo method or any good uniform generator such as mersen twister) and based on the inversion method ( here ).

For example: exponential law: generates a uniform random number u in [0,1] , then your random value of the exponential law will be: ln(1-u)/(-lambda) lambda being the exponential law parameter and ln the natural logarithm .

Hope this helps;).

0
source share

If you are not too concerned with the properties of randomness, I do it as follows:

  • create a buffer for pseudo random numbers

    double buff [MAX]; // [edit1] double pseudo random numbers

    • MAX size should be large enough ... 1024 * 128 for example
    • can be anything ( float,int,DWORD ...)
  • fill buffer with numbers

    you have a range of numbers x = < x0,x1 > and a probability function probability(x) determined by your probability distribution, so do the following:

     for (i=0,x=x0;x<=x1;x+=stepx) for (j=0,n=probability(x)*MAX,q=0.1*stepx/n;j<n;j++,i++) // [edit1] unique pseudo-random numbers buff[i]=x+(double(i)*q); // [edit1] ... 

    stepx - your precision for elements (for integer types = 1) now the buff[] array has the same distribution that you need, but it is not pseudo-random. You should also add a check if j not >= MAX to avoid overflowing the array, and also at the end the actual size of buff[] is j (maybe less than MAX due to rounding)

  • shuffle buff[]

    make just a few swap cycles buff[i] and buff[j] , where i is the loop variable and j is pseudo-random <0-MAX)

  • write your pseudo-random function

    it just returns the number from the buffer. First, the call returns buff[0] to the second buff[1] , etc. For standard generators. When you hit the end of buff[] , shuffle buff[] again and start again with buff [0]. But since you need unique numbers, you cannot reach the end of the buffer, so set MAX large enough for your task, otherwise uniqueness will not be guaranteed.

[Note]

MAX should be large enough to store the entire distribution you want. If it is not large enough, elements with a low probability may be completely absent.

[edit1] - a slightly adjusted answer to fit the needs of the question (pointed out by thanks to meriton)

PS. initialization complexity is O (N) , and for get number - O (1) .

0
source share

I think you have two problems:

  • Your itemDistribution does not know that you need a set, so when the set that you create is large, you will select many elements that are already in the set. If you start with install all the complete ones and delete the items that you will run into the same problem for very small sets.

    Is there a reason you are not itemDistribution an item from itemDistribution after you select it? Then you would not select the same item twice?

  • Choosing a data structure for itemDistribution looks suspicious to me. You want the getNextSample operation to be fast. Doesn't a map from values ​​to probability force you to iterate over large parts of the map for each getNextSample . I don’t know statistics, but could you imagine itemDistribution another way, like a map of probability or maybe the sum of all the lower probabilities + probability for an element of the set?

0
source share

Your work depends on how your getNextSample function getNextSample . If you need to iterate over all the probabilities when choosing the next item, this can be slow.

A good way to select a few unique random items from a list is to shuffle the list first and then put items from the list. You can shuffle the list once with a given distribution. From now on, selecting your m elements, simply enter a list.

A probabilistic shuffle is implemented here:

 List<Item> prob_shuffle(Map<Item, int> dist) { int n = dist.length; List<Item> a = dist.keys(); int psum = 0; int i, j; for (i in dist) psum += dist[i]; for (i = 0; i < n; i++) { int ip = rand(psum); // 0 <= ip < psum int jp = 0; for (j = i; j < n; j++) { jp += dist[a[j]]; if (ip < jp) break; } psum -= dist[a[j]]; Item tmp = a[i]; a[i] = a[j]; a[j] = tmp; } return a; } 

This is not Java, but a pseudo capture after implementation in C, so please take it with salt. The idea is to add items to the shuffled area by continuously selecting items from the intact area.

Here I used integer probabilities. (You don’t need to add stability to a special value, it is just “more is better.”) You can use floating point numbers, but due to inaccuracies, you can go beyond the array when you select an element. Then you should use the n - 1 element. If you add this secure network, you can even have elements with zero probability that will always be selected last.

There may be a way to speed up the collection cycle, but I really don't understand how to do this. Swapping makes any preliminary calculations useless.

0
source share

Accumulate your probabilities in a table

  Probability Item Actual Accumulated Item1 0.10 0.10 Item2 0.30 0.40 Item3 0.15 0.55 Item4 0.20 0.75 Item5 0.25 1.00 

Make a random number between 0.0 and 1.0 and do a binary search for the first element with a sum that is larger than your generated number. This item would be selected with the desired probability.

0
source share

The ebbe method is called sampling deviation .

I sometimes use a simple method using the inverse cumulative distribution function , which is a function that maps the number X between 0 and 1 on the Y axis. Then you simply generate a uniformly distributed random number between 0 and 1 and apply a function to it. This function is also called the "quantile function".

For example, suppose you want to create a normally distributed random number. The cumulative distribution function is called Phi . The converse to it is called probit . There are many ways to generate normal variations, and this is just one example.

You can easily build an approximate cumulative distribution function for any one-dimensional distribution that you like in a table. Then you can simply invert it by searching the table and interpolating.

0
source share

All Articles