There are two broad classification schools:
1) Discrimination . Here we try to study the boundary of the solution from examples of training. Then, based on the part of the space where the test case is located, determined by the boundary of the decision, we assign it a class. The modern algorithm is SVM , but you need kernels if your data cannot be separated by a line (for example, it is divided in a circle).
Changes in SVM for Multi-class (many ways to do this, here is one):
Let the jth (from k) learning example xj be in class i (from N). Then its label is yj = i.
a) Function vector: If xj = an example of training belonging to class i (from N), then the vector of functions corresponding to xj is phi (xj, yj) = [0 0 ... X .. 0]
Note: X is in the i-th "position". phi has a total of D * N components, where each example has D-functions, for example. the onion image has D = 640 * 480 integer gray gray
Note. For other classes p ie y = p, phi (xj, y) has "X" in the feature vector at position p, all the others are zero.
b) Limitations: Minimize W ^ 2 (as in Vanilla SVM) so that:
1) For all labels y except y1: W.phi (x1, y1)> = W.phi (x1, y) + 1
and 2) For all labels y except y2: W.phi (x2, y2)> = W.phi (x2, y) + 1
...
and k) For all labels y except yk: W.phi (xk, yk)> = W.phi (xk, y) + 1
- Note. The intuition here is that W.phi (xj, yj) is larger than all the other W.phi (xj, y1), W.phi (xj, y2), etc.
2) Generative . Here we ALLOW (which may seem silly) that each example was generated by a probability distribution for this class (for example, Gaussian for males and one for females that works well in practice), and we are trying to study the parameters - average, covariance - each distribution, calculating the average, covariance of the training examples corresponding to this class. Then, for a test case, we see which distribution gives the highest probability and classifies accordingly.
None of them use yes-no classifiers.
The discriminatory method in practice is better suited for classification, but cannot model probabilistic responses. Merging also requires a large number of training examples for the optimization phase (minimize W ^ 2). There is a technique for combining the two, avoiding nuclei called maximum discreteness of entropy.
To answer another question:
What should I do with a picture that gets high scores from both? Is there a way to get a single classifier of mushrooms or onions, which somehow knows that there is no coincidence between these two classes of vegetation?
This is more a problem with the input data, and not with the learning algorithm itself, which works only on a matrix of numbers. This may reflect noise / uncertainty in the domain (otherwise, can people distinguish mushrooms separately from onions?). This can be set by a larger / better (training) dataset. Or maybe you have chosen a poor distribution for the model, in the generative case.
Most people have pre-processed raw images before classifying them in the “Feature Selection” phase. One of the features of the choice may be the capture of the silhouette of the plant, since mushrooms and onions have different shapes, the rest of the image may be “noise”. In other areas, such as natural language processing, you can refuse prepositions and keep the number of different nouns. But sometimes performance may not improve, because the learning algorithm may not look at all functions at all. It really depends on what you are trying to capture - creativity is involved. There are also algorithms for selecting elements.
Tony Jabara 's courses at Columbia University are a good resource for machine learning.