I am trying to implement a hierarchical text classifier with scikit-learn, with one “root” classifier, which organizes all input lines into one (or more) of 50 categories. For each of these categories I am going to train a new classifier that solves the real problem.
The reason for this two-tier approach is learning performance and memory problems (the classifier, which should separate classes> 1k, does not work very well ...).
This is what my pipeline looks like for each of these “subclassifiers”
pipeline = Pipeline([
('vect', CountVectorizer(strip_accents=None, lowercase=True, analyzer='char_wb', ngram_range=(3,8), max_df=0.1)),
('tfidf', TfidfTransformer(norm='l2')),
('feat', SelectKBest(chi2, k=10000)),
('clf', OneVsRestClassifier(SGDClassifier(loss='log', penalty='elasticnet', alpha=0.0001, n_iter=10))),
])
Now, to my problem: I use SelectKBestto limit the size of the model to a reasonable amount, but sometimes input data is not enough for subclassifiers, so I don’t even fall into the limit of the 10k function, which causes
(...)
File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 300, in fit
self._check_params(X, y)
File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 405, in _check_params
% self.k)
ValueError: k should be >=0, <= n_features; got 10000.Use k='all' to return all features.
I don’t know how many functions I will have without application CountVectorizer, but I have to determine the pipeline in advance. My preferred solution was to skip the step SelectKBestif in any case there are fewer functions k, but I don’t know how to implement this behavior without calling CountVectorizertwice (once in advance, once as part of the pipeline).
Any thoughts on this?