Determining the significance of the difference between two error values

I evaluate a number of different algorithms whose task is to predict the probability of an event occurring.

I test algorithms on large data sets. I measure their effectiveness using the "Root Mean Square Error", which is the square root of the squares ((sum of errors)). The error is the difference between the predicted probability (floating point value between 0 and 1) and the actual result (0.0 or 1.0).

So, I know RMSE, as well as the number of samples the algorithm has been tested for.

The problem is that sometimes the RMSE values ​​are very close to each other, and I need a way to determine if the difference between them is just an accident or is the actual difference in performance.

Ideally, for a given pair of RMSE values, I would like to know what is the probability that it is really better than the other, so I can use this probability as a significance threshold.

+6
statistics probability
source share
3 answers

You are entering a vast and controversial field of not only computing, but also philosophy. Tests for significance and model selection are the subject of intense disagreement between Bayesians and frequent ones. Triston’s comment on dividing the data set into training and control sets will not please Bayesian.

May I suggest that RMSE is not a suitable probability estimate. If the samples are independent, then the correct score is the sum of the logarithms of the probabilities tied to the actual results . (If they are not independent, you have a mess on your hands.) What I am describing is clogging the plug-in model. For correct Bayesian modeling, it is required to integrate over the model parameters, which is extremely computationally difficult. The Bayesian way of regulating the plug-in model is to add a penalty to the score for the unlikely (large) model parameters. This is called "weight loss."

I began my journey of discovery by reading Neural Networks for pattern recognition by Christopher Bishop. I used it and practical optimization from Gill et al. To write software that worked very well for me.

+4
source share

MSE is average and therefore the central limit theorem is applied. Thus, testing whether the two MSEs are the same is the same as checking if the two means are equal. The difficulty compared to the standard test comparing the two methods is that your samples are correlated - both come from the same events. But the difference in MSE is the same as the mean square error (linear means). This involves calculating one t-test sample as follows:

  • For each x calculate the error e for procedures 1 and 2.
  • Calculate the difference of squared errors (e2^2-e1^2) .
  • Calculate the average of the differences.
  • Calculate the standard deviation of the differences.
  • Calculate t-statistics as mean/(sd/sqrt(n)) .
  • Compare the t statistics with the critical value or calculate the p value. For example, reject equality with a confidence level of 5% if |t|>1.96 .

RMSE is a monotonous transformation of MSE, so this test should not give significantly different results. But be careful not to assume that MRSE is RMSE.

A more serious problem should be excessive. Remember to compute all MSE statistics using data that you did not use to evaluate your model.

+7
source share

I answer questions in the comments. The topic is too large to process comments.

Cliff revocation version.

Types of points that we talk about probabilities of measurement. (What is appropriate for what you are doing is another question.) If you assume that the samples are independent, you get a “full” probability by simply multiplying all the probabilities together. But this usually leads to absurdly small numbers, so equivalently, you add the logarithms of probabilities. The bigger, the better. Zero is excellent.

The ubiquitous error, -x ^ 2, where x is the model error, is based on the (often unreasonable) assumption that the training data contains observations (measurements) distorted by "Gaussian noise". If you look at Wikipedia or something in the definition of a Gaussian (aka normal) distribution, you will find that it contains the term e ^ (- x ^ 2). Take the natural logarithm of this and voila !, -x ^ 2. But your models do not give the most probable values ​​of the "preliminary noise" for measurements. They produce probabilities directly. Therefore, you just need to add the logarithms of the probabilities assigned to the observed events. These observations are assumed to be silent. If the training data says it happened, it happened.

Your original question remains unanswered. How do I know if the two models differ “significantly”? This is a vague and difficult question. This topic is a lot of discussion and even emotions and anger. This is also not a question that you want to answer. What do you want to know, which model gives you the best expected profit, all the things considered, including how much each software package costs, etc.

I need to stop this soon. This is not the place for modeling and credibility courses, and I'm not very qualified as a professor.

0
source share

All Articles