Scikit learn: desired number of best features (k) not selected

I am trying to select the best functions using chi-square (scikit-learn 0.10). Out of a total of 80 training documents, I first extract 227 functions, and from these 227 functions I want to choose the top 10.

my_vectorizer = CountVectorizer(analyzer=MyAnalyzer()) X_train = my_vectorizer.fit_transform(train_data) X_test = my_vectorizer.transform(test_data) Y_train = np.array(train_labels) Y_test = np.array(test_labels) X_train = np.clip(X_train.toarray(), 0, 1) X_test = np.clip(X_test.toarray(), 0, 1) ch2 = SelectKBest(chi2, k=10) print X_train.shape X_train = ch2.fit_transform(X_train, Y_train) print X_train.shape 

The results are as follows.

 (80, 227) (80, 14) 

They are similar if I set k to 100 .

 (80, 227) (80, 227) 

Why is this happening?

* EDIT: full output example, now without trimming, where I request 30 and received instead of 32:

 Train instances: 9 Test instances: 1 Feature extraction... X_train: [[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0] [0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1] [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0] [0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]] Y_train: [0 0 0 0 0 0 0 0 1] 32 features extracted from 9 training documents. Feature selection... (9, 32) (9, 32) Using 32(requested:30) best features from 9 training documents get support: [ True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True] get support with vocabulary : [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31] Training... /usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11 scale_C) Classifying... 

Another non-cropping example where I request 10 and get 11 instead:

 Train instances: 9 Test instances: 1 Feature extraction... X_train: [[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0] [0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1] [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0] [0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]] Y_train: [0 0 0 0 0 0 0 0 1] 32 features extracted from 9 training documents. Feature selection... (9, 32) (9, 11) Using 11(requested:10) best features from 9 training documents get support: [ True True True False False True False False False False True False False False True False False False True False True False True True False False False False True False False False] get support with vocabulary : [ 0 1 2 5 10 14 18 20 22 23 28] Training... /usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11 scale_C) Classifying... 
+7
source share
1 answer

Have you checked what returns from the get_support() function get_support() should have this member function)? This returns the indices that are selected among the top k.

My hypothesis is that there are relationships due to clipping of the data that you make (or because of repeated feature vectors if your feature vectors are categorical and may have repetitions), and that the scikits function returns all records that are bound to the top k stains. An additional example where you set k = 100 raises some doubts about this hypothesis, but it's worth a look.

Look at what get_support() returns, and see what X_train looks X_train on these indices, see if the results crop a lot of function overlaps, creating links in the p-value chi ^ 2 SelectKBest that SelectKBest uses.

If this turns out to be the case, you should specify an error / problem with scikits.learn, because at the moment their documentation does not indicate what SelectKBest will do in case of links. Obviously, it cannot just take some of the linked indexes, and not others, but users should at least warn that links can lead to an unexpected decrease in the dimension of the function.

+5
source

All Articles