To classify data in N classes, is there an alternative to using N yes-no classifiers?

Question

To classify data in N classes, is there an alternative to using N yes-no classifiers?

TL DR: is there any classifier that is more complex than the yes-no classifier?

I will say in advance that I do not have a specific project that I am working on, and this is more a technical issue that I was interested in.

I have been working on several machine learning applications for one reason or another. All of these projects were designed to classify data into one of N classes, and all of them used N-classifiers (if that's what they are called). Each of these classifiers gives a part of the data a certain estimate (from 0 to 1 or from -1 to 1), which corresponds to the probability that this is the class for which the classifier was trained. Then, so that the program uses these ratings to somehow determine the best classification.

I have seen this from both nominal and continuous data, with various realizations of the final classification. For example, once I wrote a small identifier for the document language in which the classifiers were trained in English, French, German, etc., And no matter what the classifier gives the highest result. That makes sense to me.

Another project classified data on a continuous scale, mostly from 0 to 1.2, but with some data up to 6. We made 6 or so classifiers and assigned them to bins: 0-0.2, 0.2-0.4 , ... and 1.0 and higher. After all the classifiers have returned for some piece of data, we then approach the quadratic values and get the peak as the result. It makes me uncomfortable, but I don't know why.

It seems that there should be a better way than just asking a set of classifiers yes-no and trying to solve based on some algorithm. To take a stupid example, consider a system to decide if an onion or mushroom is shown in the picture. (This is literally the first thing I thought about.) I would say that the more the object looks like a bow, the less it looks like a mushroom, and from an ontological point of view I want this classification method to reflect this. If I have two yes-no classifiers that do not take into account that onnism is opposed to mushrooms, what should I do with a picture that gets high scores from both? Is there a way to get a single classifier of mushrooms or onions, which somehow knows that there is no coincidence between these two classes of vegetation? Or can I rely on a yes-no classifier training with real data to reflect this without any special intervention?

+4

language-agnostic machine-learning

andronikus Oct 14 '11 at 1:18

source share

4 answers

The idea behind your example is that each question provides information about more than one classification. If you can establish some conditional probabilities for these questions and their results, then you can also establish the level of confidence for each class.

+1

tskuzzy Oct 14 '11 at 3:09

source share

It looks like you are talking specifically about decision trees in your question. Decision trees are one of the most common types of classifiers; they are capable of processing several categories, conditional and continuous data, as well as missing values. The base decision tree algorithm is called ID3 and the popular C4.5 enhancement. Decision tree results can often be further enhanced with boosting .

+1

Don reba Oct 14 '11 at 4:33

source share

You can also just use a neural network classifier with feedback to output nodes c, one node output for each class.

It is likely that a c-class neural network will need more hidden nodes in the intermediate layer than a set of class 2 neural network classifiers. Subsequently, the choice of function indicates which input functions give the main discriminatory performance for your classification task.

To process images using neural classifiers, see, for example, my website: http://www.egmont-petersen.nl (Click "Science" and a review article from 2002).

0

user1391128 May 13, '12 at 17:18

source share

satish b · Accepted Answer · 2011-10-14T07:38:57+0000

There are two broad classification schools:

1) Discrimination . Here we try to study the boundary of the solution from examples of training. Then, based on the part of the space where the test case is located, determined by the boundary of the decision, we assign it a class. The modern algorithm is SVM , but you need kernels if your data cannot be separated by a line (for example, it is divided in a circle).

Changes in SVM for Multi-class (many ways to do this, here is one):

Let the jth (from k) learning example xj be in class i (from N). Then its label is yj = i.

a) Function vector: If xj = an example of training belonging to class i (from N), then the vector of functions corresponding to xj is phi (xj, yj) = [0 0 ... X .. 0]

Note: X is in the i-th "position". phi has a total of D * N components, where each example has D-functions, for example. the onion image has D = 640 * 480 integer gray gray
Note. For other classes p ie y = p, phi (xj, y) has "X" in the feature vector at position p, all the others are zero.

b) Limitations: Minimize W ^ 2 (as in Vanilla SVM) so that:

1) For all labels y except y1: W.phi (x1, y1)> = W.phi (x1, y) + 1

and 2) For all labels y except y2: W.phi (x2, y2)> = W.phi (x2, y) + 1

...

and k) For all labels y except yk: W.phi (xk, yk)> = W.phi (xk, y) + 1

Note. The intuition here is that W.phi (xj, yj) is larger than all the other W.phi (xj, y1), W.phi (xj, y2), etc.

2) Generative . Here we ALLOW (which may seem silly) that each example was generated by a probability distribution for this class (for example, Gaussian for males and one for females that works well in practice), and we are trying to study the parameters - average, covariance - each distribution, calculating the average, covariance of the training examples corresponding to this class. Then, for a test case, we see which distribution gives the highest probability and classifies accordingly.

None of them use yes-no classifiers.

The discriminatory method in practice is better suited for classification, but cannot model probabilistic responses. Merging also requires a large number of training examples for the optimization phase (minimize W ^ 2). There is a technique for combining the two, avoiding nuclei called maximum discreteness of entropy.

To answer another question:

What should I do with a picture that gets high scores from both? Is there a way to get a single classifier of mushrooms or onions, which somehow knows that there is no coincidence between these two classes of vegetation?

This is more a problem with the input data, and not with the learning algorithm itself, which works only on a matrix of numbers. This may reflect noise / uncertainty in the domain (otherwise, can people distinguish mushrooms separately from onions?). This can be set by a larger / better (training) dataset. Or maybe you have chosen a poor distribution for the model, in the generative case.

Most people have pre-processed raw images before classifying them in the “Feature Selection” phase. One of the features of the choice may be the capture of the silhouette of the plant, since mushrooms and onions have different shapes, the rest of the image may be “noise”. In other areas, such as natural language processing, you can refuse prepositions and keep the number of different nouns. But sometimes performance may not improve, because the learning algorithm may not look at all functions at all. It really depends on what you are trying to capture - creativity is involved. There are also algorithms for selecting elements.

Tony Jabara 's courses at Columbia University are a good resource for machine learning.

To classify data in N classes, is there an alternative to using N yes-no classifiers?

More articles: