Naive Bayes Classifier and Discriminant Analysis Accuracy

So, I have two classification methods: discriminant analysis of diagnostic classification (naive tales) and the pure Naive Bayes classifier implemented in Matlab, there are 23 classes in the entire data set. The first discriminant analysis method:

%% Classify Clusters using Naive Bayes Classifier and classify training_data = Testdata; target_class = TestDataLabels; [class, err] = classify(UnseenTestdata, training_data, target_class,'diaglinear') cmat1 = confusionmat(UnseenTestDataLabels, class); acc1 = 100*sum(diag(cmat1))./sum(cmat1(:)); fprintf('Classifier1:\naccuracy = %.2f%%\n', acc1); fprintf('Confusion Matrix:\n'), disp(cmat1) 

Obtains accuracy from the confusion matrix of 81.49% with an error rate ( err ) of 0.5040 (not sure how to interpret this).

Second Naive Bayes Classifier Method:

 %% Classify Clusters using Naive Bayes Classifier training_data = Testdata; target_class = TestDataLabels; %# train model nb = NaiveBayes.fit(training_data, target_class, 'Distribution', 'mn'); %# prediction class1 = nb.predict(UnseenTestdata); %# performance cmat1 = confusionmat(UnseenTestDataLabels, class1); acc1 = 100*sum(diag(cmat1))./sum(cmat1(:)); fprintf('Classifier1:\naccuracy = %.2f%%\n', acc1); fprintf('Confusion Matrix:\n'), disp(cmat1) 

Reaches an accuracy of 81.89% .

I checked only one round of cross-validation, im new in matlab and controlled / uncontrolled algorithms, so I did the cross-validation myself, I just basically take 10% of the data and set it aside for testing purposes, as it is a random set every time I could go through it several times and accept average accuracy, but the results will be used for explanation.

So, to my problematic question.

In my literature, a review of existing methods, many researchers find that a single classification algorithm, mixed with clustered alogorite, gives better accuracy results. They do this by picking the optimal number of clusters for their data and using split clusters (which should be more similar to each other) start each individual cluster using a classification algorithm. A process in which you can use the best parts of an uncontrolled algorithm in combination with a controlled classification algorithm.

Now I use a dataset that has been used repeatedly in the literature, and I'm trying to not easily compare with each other in my searches.

First I use the simple K-Means clustering, which unexpectedly has a good ability to copy my data. The result looks like this:

enter image description here

Looking at the cluster class labels (K1, K2 ... K12):

 %% output the class labels of each cluster K1 = UnseenTestDataLabels(indX(clustIDX==1),:) 

I find that in each cluster one class prevails in 9 clusters, and 3 clusters contain several class labels. Showing that the K-tool is well suited for data.

However the problem is that I have cluster data (cluster1, cluster2 ... cluster12):

  %% output the real data of each cluster cluster1 = UnseenTestdata(clustIDX==1,:) 

And I put each cluster through naive tales or discriminant analysis as follows:

 class1 = classify(cluster1, training_data, target_class, 'diaglinear'); class2 = classify(cluster2, training_data, target_class, 'diaglinear'); class3 = classify(cluster3, training_data, target_class, 'diaglinear'); class4 = classify(cluster4, training_data, target_class, 'diaglinear'); class5 = classify(cluster5, training_data, target_class, 'diaglinear'); class6 = classify(cluster6, training_data, target_class, 'diaglinear'); class7 = classify(cluster7, training_data, target_class, 'diaglinear'); class8 = classify(cluster8, training_data, target_class, 'diaglinear'); class9 = classify(cluster9, training_data, target_class, 'diaglinear'); class10 = classify(cluster10, training_data, target_class, 'diaglinear'); class11 = classify(cluster11, training_data, target_class, 'diaglinear'); class12 = classify(cluster12, training_data, target_class, 'diaglinear'); 

The accuracy becomes terrifying, 50% of the clusters are classified with an accuracy of 0%, each classified cluster (acc1, acc2, ... acc12) has its own mixing matrix, you can see the accuracy of each cluster here:

enter image description here

So, my problem / question is where am I mistaken, I thought at first, maybe I have data / labels mixed for clusters, but what I posted above looks right, I can not see the problem with it.

Why do the data represent the same invisible 10% of the data used in the first experiment, which gives such strange results for the same invisible clustered data? I mean, it should be noted that NB is a stable classifier and should not be easy to type and see, since the training data is extensive, while the clusters to be classified are parallel intercepts, should not happen?

EDIT:

According to the comments, I included the cmat file for the first testing example, which gives 81.49% accuracy and 0.5040 error:

enter image description here

A K fragment was also requested, the class and the cmat associated with it in this example (cluster4), accuracy 3.03% :

enter image description here

Having seen that there were a large number of classes (23 in total), I decided to reduce the classes, as indicated in the 1999 KDD Cup, this is simply an application of abit domain knowledge, since some of the attacks are more similar to each other and fall under the same umbrella term.

Then I trained a classifier with 444 thousand records, holding 10% for testing purposes.

Accuracy was worse than 73.39% , error rate was also worse than 0.4261

enter image description here

Inconsistency is broken down into its classes:

 DoS: 39149 Probe: 405 R2L: 121 U2R: 6 normal.: 9721 

Class or classified labels (result of discriminant analysis)

 DoS: 28135 Probe: 10776 R2L: 1102 U2R: 1140 normal.: 8249 

Training data consists of:

 DoS: 352452 Probe: 3717 R2L: 1006 U2R: 49 normal.: 87395 

I'm afraid if I lower the training data to have a similar perception of malicious activity, then the classifier will not have enough predictive ability to distinguish classes, however, looking at some other literature, I noticed that some researchers delete U2R, because there is not enough data for successful classification.

The methods I have tried so far are one class of classifiers, where I train the classifier to only predict one class (inefficient), classify individual clusters (even worse accuracy), decreasing class labels (2nd place) and keeping full 23 (best accuracy).

+7
source share
5 answers

As others rightly pointed out, at least one problem here:

 class1 = classify(cluster1, training_data, target_class, 'diaglinear'); ... 

You train the classifier using all training_data, but only subclusters evaluate it. To cluster data with any effect, you need to train a different classifier within each of the subclusters. Sometimes it can be very difficult - for example, in a cluster C from a class Y there can be very few (or not!) Examples. This is due to an attempt at joint clustering and learning.

The general structure of your problem is as follows:

 Training data: Cluster into C clusters Within each cluster, develop a classifier Testing data: Assign observation into one of the C clusters (either "hard", or "soft") Run the correct classifier (corresponding to that cluster) 

it

 class1 = classify(cluster1, training_data, target_class, 'diaglinear'); 

Does not do this.

+1
source

Here is a very simple example that shows how this should work and what’s wrong.

 %% Generate data and labels for each class x11 = bsxfun(@plus,randn(100,2),[2 2]); x10 = bsxfun(@plus,randn(100,2),[0 2]); x21 = bsxfun(@plus,randn(100,2),[-2 -2]); x20 = bsxfun(@plus,randn(100,2),[0 -2]); %If you have the PRT (shameless plug), this looks nice: %http://www.mathworks.com/matlabcentral/linkexchange/links/2947-pattern-recognition-toolbox % ds = prtDataSetClass(cat(1,x11,x21,x10,x20),prtUtilY(200,200)); x = cat(1,x11,x21,x10,x20); y = cat(1,ones(200,1),zeros(200,1)); clusterIdx = kmeans(x,2); %make 2 clusters xCluster1 = x(clusterIdx == 1,:); yCluster1 = y(clusterIdx == 1); xCluster2 = x(clusterIdx == 2,:); yCluster2 = y(clusterIdx == 2); %Performance is terrible: yOut1 = classify(xCluster1, x, y, 'diaglinear'); yOut2 = classify(xCluster2, x, y, 'diaglinear'); pcCluster = length(find(cat(1,yOut1,yOut2) == cat(1,yCluster1,yCluster2)))/size(y,1) %Performance is Good: yOutCluster1 = classify(xCluster1, xCluster1, yCluster1, 'diaglinear'); yOutCluster2 = classify(xCluster2, xCluster2, yCluster2, 'diaglinear'); pcWithinCluster = length(find(cat(1,yOutCluster1,yOutCluster2) == cat(1,yCluster1,yCluster2)))/size(y,1) %Performance is Bad (using all data): yOutFull = classify(x, x, y, 'diaglinear'); pcFull = length(find(yOutFull == y))/size(y,1) 
+1
source

Look at your cmat1 data of the first example (with an accuracy of 81.49%), the main reason you get high accuracy is because your classifier gets a large amount of class 1 and class 4 correctly. almost all other classes perform poorly (getting zero correct predictions). And that matches your latest example (use k-tools first), where for cluster7 you get acc7 from 56.9698.

EDIT : It seems that in cmat1 we do not have test data for more than half of the classes (looking at all the null lines). That way, you can only know the overall performance for classes like 1 and 4 , and get the same performance if clustering first. But for other classes, this does not indicate that it is working properly.

-one
source

After you copy your data, do you translate the classifier for each cluster? If you do not, this may be your problem.

Try to do it. First, group your data and save centroids. Then, using the training data, prepare a classifier for each cluster. For the classification phase, find the nearest centroid of the object you want to classify and use the appropriate classifier.

One classifier is not a good idea, because it studies samples of the entire data set. However, what you want when you are a cluster is to study the local patterns that describe each cluster.

-one
source

Consider this function call:

 classify(cluster1, training_data, target_class, 'diaglinear'); 

training_data is an example of the entire attribute space. What does it mean? The classification model that you train will try to maximize the classification accuracy for the entire space of objects. This means that if you show test samples that have the same behavior as your training data, you will get classification results.

The point is, you are not showing test samples that have the same behavior as your training data. In fact, cluster1 is an example of only a section of your space. More specifically, the instances in cluster1 correspond to your spatial patterns that are closer to the centroid of cluster1 than the other centroids, and this can degrade the performance of your classifier.

Therefore, I offer you the following:

  • Copy your training set and save the centroids
  • Using the training data, prepare a classifier for each cluster. That is, to train the classifier, use only instances belonging to this cluster.
  • For the classification stage, find the nearest centroid of the object you want to classify and use the appropriate classifier.
-one
source

All Articles