Drawing cards from a deck in SciPy using scipy.stats.hypergeom

Question

Drawing cards from a deck in SciPy using scipy.stats.hypergeom

I'm having trouble understanding the documentation for the SciPy functions scipy.stats.hypergeom . In my program, I look at different decks of cards and try to find the probability of different draws. The hypergeom class seems to have just that, but its documentation involves a bunch of terminological knowledge that I don’t have. Googling leads me to Wikipedia and Wolfram MathWorld, both of which suggest that if you ask about it, you all read from dang Principia Mathematica forward and just need a little update - so they are not really useful, because this problem is in, "how do I apply this particular piece of code to my problem?" I ask for a stack overflow.

I have a problem with the form: "If you have a deck of N cards, M of which is an interest card, what are the chances of having at least 1 copy of the interest card in the top Q-cards?" I also have a problem with the form "if you have a deck of N cards whose M is an interest card, how many cards should you make from a deck in order to have a 90% chance that one of them is a copy of an interest card?" The first problem is very close to the approximate problem posed in the SciPy documentation, but it’s not the same thing, and the list of methods is all the jargon for me - I can’t say which one is the one I need. I also cannot determine which method to use for the latter type of problem.

What do scipy.stats.hypergeom methods really do, what are their arguments and how can I apply them to my problems? Pretend I'm a moderately bright high school student, not a PhD in mathematics.

+4

python scipy probability

Brighid mcdonnell Jul 01 '12 at 23:34

source share

2 answers

It is worth noting that this is not difficult to solve without using scipy at all. Let's say we have a random permutation of 10 elements:

 4, 7, 2, 3, 0, 9, 1, 5, 6, 8

And a set of "winners" 2, 4, 6 . Since all we care about are winners and losers, we can simplify our view a bit:

 1, 0, 1, 0, 0, 0, 0, 0, 1, 0

We can do the same with any set of 10 possible items and 3 winners; and with any possible permutation of 10 points, we can perform the same simplification. So what actually happens is that each permutation “selects” 3 winning indexes, and the number of possible arrangements of the winners in deck 10 selects 3 , or 10! / (3! * 7!) 10! / (3! * 7!) .

Now we need to know how many of those possible winners give us at least one winner in the first Q cards. Since it’s easier to calculate how many arrangements give us exactly zero winners in the first Q cards, we will calculate this instead. So we want, in the most specific terms, the number of sequences that look like this (for Q = 4):

 0, 0, 0, 0 | 0, 1, 0, 1, 1, 0

Here we divided the sequence, and the values preceding the section should always be zero. How many such sequences exist? Well, such sequences are exactly the same as sequences containing 3 winners in 6 cards. So 6 select 3, i.e. 6! / (3! * 3!) 6! / (3! * 3!) .

So, to get the chances that with a random permutation of 10 values, the first three do not contain a winner, we simply calculate the following:

 (6 choose 3) / (10 choose 3)

And to get the inverse coefficients (i.e. the probability that at least one of the first three contains a winner), we do the following:

 1 - (6 choose 3) / (10 choose 3)

Summarizing, with total = N , winners = M and tries = Q :

 1 - ((N - Q) chose M) / (N chose M)

In python, it looks like this:

 >>> def choose(n, x): ... return reduce(mul, range(n, n - x, -1)) / math.factorial(x) ... >>> def ntries_win_odds(total, winners, tries): ... inv = (choose(total - tries, winners) / float(choose(total, winners))) ... return 1 - inv

Deciding in the opposite direction is not too complicated - we just need a "reverse selection" function that solves c = n choose x for N given c and x . Where I want, there is an algorithmic improvement, but this works:

 >>> def choose_x_pseudoinverse(target, x): ... for n in itertools.count(start=x): ... if choose(n, x) >= target: ... return n

Now the solution for trying:

 odds = 1 - ((total - tries) chose winners) / (total chose winners) (1 - odds) * (total choose winners) = ((total - tries) chose winners) choose_x_inv((1 - odds) * (total choose winners), winners) = total - tries tries = total - choose_x_inv((1 - odds) & (total choose winners), winners)

In Python, this is

 def ntries_from_odds(odds, total, winners): inv_odds = 1 - odds tCw = choose(total, winners) return total - choose_x_pseudoinverse(inv_odds * tCw, winners)

+2

senderle Jul 02 '12 at 2:55

source share

Hugh bothwell · Accepted Answer · 2012-07-02T01:57:44+0000

 scipy.stats.hypergeom.pmf(k, M, n, N)

returns the probability that: from M-cards, n of which are marked, if you randomly select N cards without replacement, k cards will be precisely marked.

So, you can get the desired answer (using your variable names) on

 def pick_Q(N, M, Q): """ Given a deck of N cards, where M are marked, and Q cards are taken randomly without replacement, return the probability that at least one marked card is taken. """ return sum(scipy.stats.hypergeom.pmf(k, N, M, Q) for k in xrange(1,Q+1))

(the sum of the coefficients marked with 1 card, 2 cards are marked, 3 cards are marked ... N cards are marked).

Fortunately, there is a faster way — the probability that at least one marked card will be drawn is the flip side of the probability that no marked card will be selected. So instead you can do

 def pick_Q(N, M, Q): """ Given a deck of N cards, where M are marked, and Q cards are taken randomly without replacement, return the probability that at least one marked card is taken. """ return 1. - scipy.stats.hypergeom.pmf(0, N, M, Q)

For your second question, there are no functions that do what you want; however you can start with

 def how_many_to_pick(N, M, prob): """ Given a deck of N cards, M of which are marked, how many do you have to pick randomly without replacement to have at least prob probability of picking at least one marked card? """ for q in xrange(1, M+1): if pick_Q(N, M, q) >= prob: return q raise ValueError("Could not find a value for q")

Edit:

 scipy.stats.hypergeom.cdf(k, M, n, N)

Given a deck of M-cards, n of which are marked, select N randomly without replacement, find the odds chosen by k or fewer marked cards. (You can think of it as an integral from .pmf)

Then .sf (k, M, n, N) is the reverse side .cdf are the coefficients that were selected by more than k marked cards.

For instance,

  k pmf(k,52,13,4) cdf(k,52,13,4) sf(k,52,13,4) (exactly k picked) ( <= k picked) ( > k picked) --- ----------------- --------------- -------------- 0 0.303817527 0.303817527 0.696182473 1 0.438847539 0.742665066 0.257334934 2 0.213493397 0.956158463 0.043841537 3 0.041200480 0.997358944 0.002641056 4 0.002641056 1.000000000 0.000000000

Edit2:

in fact, it gives a different way to write the pick_Q function - “selecting 1 or more marked cards” can be rephrased as “selecting more than 0 marked cards”, therefore

 def pick_Q(N, M, Q): """ Given a deck of N cards, where M are marked, and Q cards are taken randomly without replacement, return the probability that at least one marked card is taken. """ return scipy.stats.hypergeom.sf(0, N, M, Q)

Drawing cards from a deck in SciPy using scipy.stats.hypergeom

More articles: