Naive Bayes without Naiva Assumption

I am trying to understand why the naive Bayes classifier scales linearly with the number of functions, compared to the same idea without a naive assumption. I understand how the classifier works , and what is so "naive" about it. It is not clear why the naive assumption gives us linear scaling, while the removal of this assumption is exponential. I am looking for the passage of an example showing an algorithm under a “naive” parameter with linear complexity, and the same example without this assumption, which will demonstrate exponential complexity.

+6
source share
2 answers

The problem here is the following value

P(x1, x2, x3, ..., xn | y) 

which you must evaluate. When you accept "naivety" (function independence), you get

 P(x1, x2, x3, ..., xn | y) = P(x1 | y)P(x2 | y) ... P(xn | y) 

and you can evaluate each P(xi | y) independently. Naturally, this approach scales linearly , because if you add other functions k , you need to evaluate other probabilities k , each of which uses a very simple method (for example, counting objects with a given function).

Now, without naivety, you do not have any decomposition. So you have to keep track of all the probabilities of the form

 P(x1=v1, x2=v2, ..., xn=vn | y) 

for all possible values ​​of vi . In the simplest case, vi is simply “true” or “false” (the event occurred or not), and this already gives you 2^n probabilities for evaluation (each possible assignment is “true” and “false” for a series of n Boolean variables). Therefore, you have an exponential increase in the complexity of the algorithm. However, the biggest problem here is usually not computational, but rather a lack of data . Since there is a 2^n probability for an estimate, you need more than 2^n data points to have any estimate for all possible events. In real life, you will never come across a data set of 10,000,000,000,000 points in size ... and this is a series of mandatory (unique!) Points for 40 functions with this approach.

+8
source

Candy selection

An old grandmother lived on the outskirts of Mumbai, whose quantitative outlook on life earned her the nickname "Statistical Grandmother." She lived alone in a huge mansion, where she practiced sound statistical analysis , protected from a barrage of hopelessly erroneous prejudices, common as common sense on the part of the media and the so-called peasant scientists.

Every year on her birthday, her whole family will visit her and remain in the mansion. Sons, daughters, their spouses, grandchildren. It will be a big bash every year, with lots of fanfare. But what Grandma loved most was to meet her grandchildren and play with them. She had ten grandchildren, all of them about 10 years old, and she fondly call them "strong" random variables.

Each year, the grandmother gave each child a candy. My grandmother had a big box with sweets from ten different kinds. She gave one candy to each of the children, since she did not want to spoil her teeth. But, since she loved children so much, she made great efforts to decide what kind of candy to give her child in order to maximize their complete happiness (maximum likelihood assessment, as she called it).

But it was not an easy task for my grandmother. She knew that every type of candy had a certain chance of making a child happy. This probability was different for different types of sweets and for different children. Rahkesh liked the red candy more than the green, while Sheila liked the orange, above all.

Each of the 10 children had different preferences for each of the 10 sweets.

Moreover, their preferences were largely dependent on external factors that were unknown ( hidden variables ) for grandmother.

If Samer saw a blue building on the way to the mansion, he would need a blue candy, and Sandeep always wanted the candy to match the color of his shirt that day. But the biggest problem was that their happiness depended on what kind of candy other children received! If Rohan gets a red candy, then Niyati will also want a red candy, and everything else will make her cry in her mother’s hands (conditional dependence). Sakshi always wanted most children to receive (positive correlation), while Tanmay would be happier if no one received the candy that he received (negative correlation). Grandmother had long finished that her grandchildren were completely interdependent.

It was a computational task for my grandmother to get the right candy choice. There were too many conditions , and she could not simplify the calculation. Every year before her birthday, she spent days figuring out the optimal purpose of sweets, listing all the sweets configurations for all the children together (which was an exponentially expensive task). She was aging, and the task was becoming more and more difficult. It seemed to her that she would die before understanding the optimal choice of sweets that would make her children the happiest at once.

But an interesting thing happened. Over the years, and the children grew up, they finally moved from a teenager and turned into independent adults. Their choice became less and less dependent on each other, and it became easier to find out which one was the most preferred candy (they all still loved candy and grandmother).

Grandmother quickly realized this, and she happily began to call them " independent random variables ." It was much easier for her to figure out the optimal choice of sweets - she just had to think about each child at a time, and assign a probability of happiness to each of the 10 types of sweets for each child for each child. Then she will choose a candy with the highest probability of happiness for this child, without worrying about what she will assign to other children. It was a very simple task, and grandmother was finally able to fix it.

That year, the children were finally the happiest at once, and grandmother had a great time on her fortieth birthday. A few months later, her grandmother passed away, with a smile on her face, and a copy of Sheldon Ross clung to her hand.

Stakeout : In statistical modeling, the presence of interdependent random variables makes it difficult to determine the optimal distribution of values ​​for each variable, which maximizes the cumulative probability of the set.

You need to list all possible configurations (which increase exponentially in the number of variables). However, if the variables are independent, it’s easy to select individual destinations that maximize the probability of each variable, and then combine the individual destinations to get the configuration for the entire set.

In Naive Bayes, you make the assumption that the variables are independent (even if they really aren't). This simplifies your calculation, and it turns out that in many cases it actually gives estimates that are comparable to the estimates that you would get from a more expensive model (from a computational point of view) that takes into account conditional dependencies between variables.

There was no mathematics in this answer, but, hopefully, it facilitated an understanding of the concept underlying Naive Bayes and a confident approach to mathematics. (Wikipedia page - a good start: Naive Bayes).

Why is this "naive"?

The Naive Bayes classifier assumes that X | Yx | Y is normally distributed with zero covariance between any XX components. Since this is an absolutely implausible assumption for any real task, we call it naive.

Naive Bayes will make the following assumption:

If you like Pickles and you like ice cream, naive bays will take over the independence and give you salty ice cream and think you will like it.

This may not be true.

For a mathematical example, see https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/

+2
source

All Articles