SelectKBest based on (estimated) number of functions

I am trying to implement a hierarchical text classifier with scikit-learn, with one “root” classifier, which organizes all input lines into one (or more) of 50 categories. For each of these categories I am going to train a new classifier that solves the real problem.

The reason for this two-tier approach is learning performance and memory problems (the classifier, which should separate classes> 1k, does not work very well ...).

This is what my pipeline looks like for each of these “subclassifiers”

pipeline = Pipeline([
    ('vect', CountVectorizer(strip_accents=None, lowercase=True, analyzer='char_wb', ngram_range=(3,8), max_df=0.1)),
    ('tfidf', TfidfTransformer(norm='l2')),
    ('feat', SelectKBest(chi2, k=10000)),
    ('clf', OneVsRestClassifier(SGDClassifier(loss='log', penalty='elasticnet', alpha=0.0001, n_iter=10))),
])

Now, to my problem: I use SelectKBestto limit the size of the model to a reasonable amount, but sometimes input data is not enough for subclassifiers, so I don’t even fall into the limit of the 10k function, which causes

(...)
  File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 300, in fit
    self._check_params(X, y)
  File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 405, in _check_params
    % self.k)
ValueError: k should be >=0, <= n_features; got 10000.Use k='all' to return all features.

I don’t know how many functions I will have without application CountVectorizer, but I have to determine the pipeline in advance. My preferred solution was to skip the step SelectKBestif in any case there are fewer functions k, but I don’t know how to implement this behavior without calling CountVectorizertwice (once in advance, once as part of the pipeline).

Any thoughts on this?

+4
source share
3 answers

, SelectKBest , k , -.

+3

Martin Krämer SelectKBest, :

class SelectAtMostKBest(SelectKBest):

    def _check_params(self, X, y):
        if not (self.k == "all" or 0 <= self.k <= X.shape[1]):
            # set k to "all" (skip feature selection), if less than k features are available
            self.k = "all"

, , , ...

+5

SelectPercentile, , .

0

All Articles