Well, there is a machine learning industry specifically designed to solve this problem (labeling datasets is expensive): semi-managed learning
Honestly, in my experience, the calculation is pretty terribly long, and the results are faint compared to completely labeled datasets ... But itβs better to train on a large unlabeled dataset, and not with anything!
Edit: Well, I first understood the question: "The dataset label is expensive" and not "The dataset size will be small, no matter what"
Well, by the way, I would:
Adjust my settings by using leave one of the cross-checks . Most calculations are expensive, but the best.
Choose algorithms that have fairly fast convergence. (You need a comparison table that I donβt have right now)
We need very good generalization properties. In this case, good combinations of weak classifiers. kNN (k nearest neighbors) are very bad.
Change the "generalize" option. Most algorithms consist in a compromise between generalization (regularity) and quality (is the training set a well-classified classifier?). If your data set is small, you should shift the algorithm to generalize (after setting parameters with cross-validation)
source share