I answer questions in the comments. The topic is too large to process comments.
Cliff revocation version.
Types of points that we talk about probabilities of measurement. (What is appropriate for what you are doing is another question.) If you assume that the samples are independent, you get a “full” probability by simply multiplying all the probabilities together. But this usually leads to absurdly small numbers, so equivalently, you add the logarithms of probabilities. The bigger, the better. Zero is excellent.
The ubiquitous error, -x ^ 2, where x is the model error, is based on the (often unreasonable) assumption that the training data contains observations (measurements) distorted by "Gaussian noise". If you look at Wikipedia or something in the definition of a Gaussian (aka normal) distribution, you will find that it contains the term e ^ (- x ^ 2). Take the natural logarithm of this and voila !, -x ^ 2. But your models do not give the most probable values of the "preliminary noise" for measurements. They produce probabilities directly. Therefore, you just need to add the logarithms of the probabilities assigned to the observed events. These observations are assumed to be silent. If the training data says it happened, it happened.
Your original question remains unanswered. How do I know if the two models differ “significantly”? This is a vague and difficult question. This topic is a lot of discussion and even emotions and anger. This is also not a question that you want to answer. What do you want to know, which model gives you the best expected profit, all the things considered, including how much each software package costs, etc.
I need to stop this soon. This is not the place for modeling and credibility courses, and I'm not very qualified as a professor.
Jive dadson
source share