How to effectively evaluate probability based on a small amount of evidence?

I have been trying to find the answer for this for several months (for use in a machine learning application), it seems like this is not a terribly difficult problem, but I am a software engineer and mathematician has never been my forte.

Here is the scenario:

I have a (possibly) unevenly weighted coin, and I want to find out the probability of its occurrence. I know that coins from the same box as this one have an average probability p , and I also know the standard deviation of these probabilities (call it s ).

(If other cumulative properties of the probabilities of other coins, except their average and stddev, were useful, I will probably get them too.)

I throw coins n times, and she gains heads h times.

The naive approach is that the probability is h / n , but if n is small, it is unlikely to be accurate.

Is there a computationally efficient way (i.e., does not include very large or very small numbers) to take p and s in order to come up with a more accurate estimate of the probability, even if n is small?

I would appreciate it if any answers could use pseudo-code rather than mathematical notation, since I believe that most mathematical notation is impervious; -)


Other answers: There are other answers to SO that are similar, but the answers provided are unsatisfactory. For example, this is not effective in terms of computational efficiency, since it quickly includes numbers that are less than they can be represented even in double-precision voyages. And this one turned out to be wrong.

+4
source share
5 answers

You can use p as a preliminary to your perceived probability. This is basically the same as pseudo-counting smoothing. Ie use

 (h + c * p) / (n + c) 

as your rating. When h and n big, then it just becomes h / n . When h and n are small, it is simply c * p / c = p . The choice of c up to you. You can set it to s , but in the end you have to decide how small is too small.

+2
source

Unfortunately, you cannot do machine learning without knowing any basic mathematics — it's like asking someone to help with programming, but you don’t want to know about “variables”, “subroutines” and all of these cases.

The best way to do this is called Bayesian integration, but there is a simpler approximation called "maximum postieri" (MAP). This is pretty much normal thinking, except that you can use the previous distribution.

Bizarre words, but you may ask, where did the formula h / (h + t) come from? Of course, this is obvious, but it turns out that this is the answer that you get when you have no. And the method below is the next level of difficulty when you add the previous one. The transition to Bayesian integration would be as follows, but more complex and possibly unnecessary.

As I understand it, the problem is twofold: first you draw a coin from a bag of coins. This coin has a “dizziness” called theta, so that it gives the head theta fraction flip. But theta for this coin comes from the main distribution, which, I believe, I consider to be Gaussian with an average value of P and a standard deviation of S.

Next, you must record the total unnormalized probability (called likelihood) to view all shebang, all data: (h head, t tails)

L = (theta) ^ h * (1-theta) ^ t * Gaussian (theta; P, S).

Gaussian (theta; P, S) = exp (- (theta-P) ^ 2 / (2 * S ^ 2)) / sqrt (2 * Pi * S ^ 2)

This is the value of "first drawing 1 theta value from gaussian," and then draw h heads and t tails from the coin using this theta.

In principle, MAP, if you don't know theta, find a value that maximizes L, given the data you know. You do this with calculus. The trick to make this easier is to take the logarithms first. Define LL = log (L). Where L is maximum, then LL will also be.

so LL = hlog (theta) + tlog (1-theta) + - (theta-P) ^ 2 / (2 * S ^ 2)) - 1/2 * log (2 * pi * S ^ 2)

By calculus for searching for extrema, you will find theta value such that dLL / dtheta = 0. Since the last member with the log does not have a theta in it, you can ignore it.

dLL / dtheta = 0 = (h / theta) + (P-theta) / S ^ 2 - (t / (1-theta)) = 0.

If you can solve this equation for theta, you will get the answer, a MAP estimate for theta given by the number of chapters h and the number of tails t.

If you want to quickly approximate, try one step of the Newton method, where you start with the proposed theta with an obvious (called maximum likelihood) estimate of theta = h / (h + t).

And where does this "obvious" estimate come from? If you do something higher, but do not set the Gaussian earlier: h / theta-t / (1-theta) = 0, you will come with theta = h / (h + t).

If your previous probabilities are really small, as often happens, instead of almost 0.5, then the Gaussian earlier on the aunt probably does not fit, because he predicts some weight with negative probabilities, clearly wrong. More appropriate is Gaussian earlier on log theta ("lognormal distribution"). Connect it in the same way and do the calculus.

+3
source

You do not have enough information in this matter.

How many coins are in the box? If these are two, then in some scenarios (for example, one coin is always the head, the other is always the tails), knowing that p and s will be useful. If it is more than a few, and especially if only some of the coins are slightly weighted, then this is not useful.

What is small n? 2? 5? ten? one hundred? What is the likelihood of a weighted coin raising its head / tail? 100/0, 60/40, 50,00001 / 49,99999? How is weighting distributed? Is each coin one of two possible weights? Do they follow the bell curve? and etc.

It boils down to the following: the differences between a weighted / unweighted coin, the distribution of weighted coins and the number of coins in your box will decide what should be with you for you with a high degree of confidence.

The name of what you are trying to do is Bernoulli probe . Knowing the name should help you find the best resources.


Reply to comment:

If you have differences in the small value of p, you will have to do a lot of testing and not do it.

Assuming a uniform bias distribution, p will still be 0.5, and all standard deviations will tell you that at least some of the coins have a slight bias.

How many throws, again, will be determined in these conditions by weighing the coins. Even with 500 melancholy you will not get strong confidence (about 2/3), revealing the split .51 / .49.

+2
source

In general, what you are looking for is Maximum Likelihood Assessment . In the Wolfram demo project, there is an illustration of the likelihood of landing on a coin , given the cast selection.

+1
source

Well, I'm not a mathematician, but I think that a simple Bayesian approach is intuitive and widely applicable to put a little into it. The others above have already suggested this, but perhaps if you, like me, you would prefer more verbosity. In this jargon, you have a set of mutually exclusive hypotheses H and some data D, and you want to find the (back) probability that each hypothesis Hi is valid with data. Presumably, you would choose the hypothesis that has the highest posterior probability (MAP, as indicated above), if you had to choose it. As Matt noted above, what distinguishes the Bayesian approach from maximum likelihood (finding H maximizing Pr (D | H)) is that you also have some PRIOR information about which hypotheses are most likely, and you want to include these priorities .

So, you have the main probability Pr (H | D) = Pr (D | H) * Pr (H) / Pr (D). You can evaluate these Pr (H | D) numerically by creating a series of discrete probabilities Hi for each hypothesis that you want to test, for example [0,0,0.05, 0,1 ... 0,95, 1,0], and then define your previous Pr (H) for each Hi - the above assumes you have a normal priors distribution, and if this is acceptable, you can use the mean and stdev to get each Pr (Hi) - or use a different distribution if you want. With coins, coins Pr (D | H), of course, is determined binomial, using the observed number of successes with n tests and the specific Hi being tested. The denominator Pr (D) may seem complicated, but we assume that we have covered all the bases with our hypotheses, so Pr (D) is a summation of Pr (D | Hi) Pr (H) over all H.

It is very simple if you think about it a little, and perhaps not so if you think about it a little more.

+1
source

All Articles