Implementation of the Naive Bayes algorithm in Java - you need to be guided

As a school assignment, I need to implement the Naive Bayes algorithm, which I intend to do in Java.

In an attempt to understand how this is done, I read the book "Data Mining - Practical Tools and Methods of Machine Learning", which has a section on this topic, but I'm still not sure about some of the main points that block my progress.

Since I'm looking for guidance, not a solution here, I will tell you guys what I think in my head, what I think is right, and in return ask for a correction / guide that I really like. note that I am an absolute newbie in Naive Bayes algorithm, data mining and general programming, so you can see silly comments / calculations below:

The established training data set has 4 attributes / functions that are numerical and normalized (in the range [0 1]) using Weka (without missing values) and one nominal class (yes / no)

1) The data coming from the csv file is numeric. Hence

  • * Given that the attributes are numerical, I use the PDF formula (probability density function).
    • + To compute PDF in java, first we split the attributes based on whether they are in the yes or class no class and hold them in another array
(array class yes and array class no)
  • + Then calculate the average value (
sum of the values in row / number of values in that row ) and the standard prediction for each of the 4 attributes (columns) of each class
  • + Now, to find the PDF of the set value (n) I do
(n-mean)^2/(2*SD^2),
  • + Then, to find
P( yes | E) and P( no | E) I multiply the PDF value of all 4 given attributes and compare which is larger , which indicates the class that it belongs to

In Java temrs, I use ArrayList of ArrayList and Double to store attribute values.

Finally, I'm not sure how to get new data? Should I request an input file (e.g. csv) or command line and request 4 values?

I will stay here for the moment (I have more questions), but I worry it will not give any answers how much time it received. I will be very grateful to those who take the time to read my problems and comments.

+7
java algorithm data-mining
source share
1 answer

What you do is almost right.

  + Then to find P( yes | E) and P( no | E) i multiply the PDF value of all 4 given attributes and compare which is larger, which indicates the class it belongs to 

Here you forgot to multiply the previous P (yes) or P (no). Remember the solution formulas:

 P(Yes | E) ~= P(Attr_1 | Yes) * P(Attr_2 | Yes) * P(Attr_3 | Yes) * P(Attr_4 | Yes) * P(Yes) 

For Naive Bayes (and any other supervised learning / grading algorithms) you need training data and test data. You use training data to model training and predict test data. You can simply use training data as test data. Or you can split the csv file into two parts, one for training and one for testing. You can also cross-validate in the csv file.

+5
source share

All Articles