In machine learning, what can you do to limit the number of training samples you need?

In many applications, creating a large set of training materials can be very expensive, if not completely impossible. So, what steps can be taken to limit the size required for good accuracy?

+4
source share
1 answer

Well, there is a machine learning industry specifically designed to solve this problem (labeling datasets is expensive): semi-managed learning

Honestly, in my experience, the calculation is pretty terribly long, and the results are faint compared to completely labeled datasets ... But it’s better to train on a large unlabeled dataset, and not with anything!


Edit: Well, I first understood the question: "The dataset label is expensive" and not "The dataset size will be small, no matter what"

Well, by the way, I would:

  • Adjust my settings by using leave one of the cross-checks . Most calculations are expensive, but the best.

  • Choose algorithms that have fairly fast convergence. (You need a comparison table that I don’t have right now)

  • We need very good generalization properties. In this case, good combinations of weak classifiers. kNN (k nearest neighbors) are very bad.

  • Change the "generalize" option. Most algorithms consist in a compromise between generalization (regularity) and quality (is the training set a well-classified classifier?). If your data set is small, you should shift the algorithm to generalize (after setting parameters with cross-validation)

+3
source

All Articles