Pearl programming - random selection algorithm

Question

Pearl programming - random selection algorithm

Page 120 of Programming Pearls 1st edition presents this algorithm for selecting M equiprobable random elements from a collection of N integers.

InitToEmpty Size := 0 While Size < M do T := RandInt(1,N) if not Member(T) Insert(T) Size := Size + 1

It is stated that the expected number of member tests is less than 2M if M <N / 2.

I would like to know how to prove this, but my analysis of the algorithms does not allow me.

I understand that the closer M is to N, the longer the program will run, because there will be more elements in the result set, and the probability that RandInt will choose an existing one will increase proportionally.

Can you help me deal with this evidence?

+6

algorithm

mpm Jul 19 '10 at 16:12

source share

2 answers

Suppose that we selected K elements from N. Then our next attempt has a probability of (NK) / N of the next, so the number of attempts that is required to find K + 1 of the element is geometrically distributed with the average N / (NK).

So, if 2M <N, we expect him to take less than two attempts to get each element.

0

deinst Jul 19 '10 at 16:28

source share

Bill prin · Accepted Answer · 2010-07-19T16:28:12+0000

I'm not a math wizard, but I will give him a rude shot. This is NOT guaranteed, though.

For each additional member M you choose a number, see if it is there, and if it adds it. Otherwise, try again. Trying something until you succeed is called a distribution of geometric probabilities.

http://en.wikipedia.org/wiki/Geometric_distribution

So, you are doing M geometric tests. Each test has an expected value of 1 / p, so it is expected that 1 / p will try to get a number not yet in M. p equal to N minus the number of numbers that we have already added from M, divided by N (i.e. unlimited elements / total number of items). Thus, for the fourth number p = (N -3) / N, which is the probability of choosing an unused number, so the expected number of samples for the third number is N / N-3.

The expected value of the runtime is all added together. So something like

E (lead time) = N / N + N / (N -1) + N / (N -2) ... + N / (NM)

Now, if M <N / 2, then the last element of this summation is bounded above by 2. ((N / N / 2) == 2)). This is also, obviously, the biggest element in the whole summation. So if the largest element is two choices, and there are M elements that add up, EV of the entire runtime is limited to 2M higher.

Ask me if this is unclear. Correct me if it is not so :)

Pearl programming - random selection algorithm

More articles: