I am trying to select the best functions using chi-square (scikit-learn 0.10). Out of a total of 80 training documents, I first extract 227 functions, and from these 227 functions I want to choose the top 10.
my_vectorizer = CountVectorizer(analyzer=MyAnalyzer()) X_train = my_vectorizer.fit_transform(train_data) X_test = my_vectorizer.transform(test_data) Y_train = np.array(train_labels) Y_test = np.array(test_labels) X_train = np.clip(X_train.toarray(), 0, 1) X_test = np.clip(X_test.toarray(), 0, 1) ch2 = SelectKBest(chi2, k=10) print X_train.shape X_train = ch2.fit_transform(X_train, Y_train) print X_train.shape
The results are as follows.
(80, 227) (80, 14)
They are similar if I set k to 100 .
(80, 227) (80, 227)
Why is this happening?
* EDIT: full output example, now without trimming, where I request 30 and received instead of 32:
Train instances: 9 Test instances: 1 Feature extraction... X_train: [[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0] [0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1] [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0] [0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]] Y_train: [0 0 0 0 0 0 0 0 1] 32 features extracted from 9 training documents. Feature selection... (9, 32) (9, 32) Using 32(requested:30) best features from 9 training documents get support: [ True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True] get support with vocabulary : [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31] Training... /usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11 scale_C) Classifying...
Another non-cropping example where I request 10 and get 11 instead:
Train instances: 9 Test instances: 1 Feature extraction... X_train: [[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0] [0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1] [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0] [0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]] Y_train: [0 0 0 0 0 0 0 0 1] 32 features extracted from 9 training documents. Feature selection... (9, 32) (9, 11) Using 11(requested:10) best features from 9 training documents get support: [ True True True False False True False False False False True False False False True False False False True False True False True True False False False False True False False False] get support with vocabulary : [ 0 1 2 5 10 14 18 20 22 23 28] Training... /usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11 scale_C) Classifying...
Dt
source share