In machine learning, what can you do to limit the number of training samples you need?

Question

In machine learning, what can you do to limit the number of training samples you need?

In many applications, creating a large set of training materials can be very expensive, if not completely impossible. So, what steps can be taken to limit the size required for good accuracy?

+4

machine-learning

static_rtti Aug 10 '11 at 13:56

source share

1 answer

Fezvez · Accepted Answer · 2011-08-10T16:31:49+0000

Well, there is a machine learning industry specifically designed to solve this problem (labeling datasets is expensive): semi-managed learning

Honestly, in my experience, the calculation is pretty terribly long, and the results are faint compared to completely labeled datasets ... But it’s better to train on a large unlabeled dataset, and not with anything!

Edit: Well, I first understood the question: "The dataset label is expensive" and not "The dataset size will be small, no matter what"

Well, by the way, I would:

Adjust my settings by using leave one of the cross-checks . Most calculations are expensive, but the best.
Choose algorithms that have fairly fast convergence. (You need a comparison table that I don’t have right now)
We need very good generalization properties. In this case, good combinations of weak classifiers. kNN (k nearest neighbors) are very bad.
Change the "generalize" option. Most algorithms consist in a compromise between generalization (regularity) and quality (is the training set a well-classified classifier?). If your data set is small, you should shift the algorithm to generalize (after setting parameters with cross-validation)

In machine learning, what can you do to limit the number of training samples you need?

More articles: