How to classify using TfidfVectorizer plus metadata in practice?

Question

How to classify using TfidfVectorizer plus metadata in practice?

I am trying to classify some documents into two classes in which I use TfidfVectorizer as a feature extraction method.

The input data consists of data lines containing about a dozen data fields float, label and text block of the document body. In order to use the body, I applied TfidfVectorizer and got a sparse matrix (which I can examine by converting to an array via toarray ()). This matrix is usually very large, in thousands of thousands of dimensions - let this F be 1000 x 15000 in size.

To use the classifier in Scikit, I give it the input matrix X, which (number of rows * number of functions). If I do not use the body, I may have X 1000 x 15 in size.

Here is the problem, suppose I add the horizontal stack of this F to X, so X will become 1000 x 15015, which will add a few problems: 1) The first 15 functions will now play a very small role; 2) Not enough memory;

Scikit provided an example that uses only the TfidfVectorizer input, but does not shed light on how to use it with metadata.

My question is: How do you use the output of TfidfVectorizer along with metadata to fit into the classifier for training?

Thanks.

+4

scikit-learn machine-learning classification tf-idf

log0 Oct 19 '13 at 14:01

3

Fred Foo · Answer 1 · 2013-10-19T17:26:29+0000

(tf-idf), X_tfidf.
, X_metadata.

:

X = scipy.sparse.hstack([X_tfidf, X_metadata])

, :

from sklearn.preprocessing import normalize
X = normalize(X, copy=False)

, LinearSVC, LogisticRegression SGDClassifier, , ; . , , , .. .

( /, SVM k-NN, .)

lejlot · Answer 2 · 2013-10-19T16:47:04+0000

tf-idf , :

, , - , , 1% -. .
"" , , , , , , N_not_meta/N_meta, N_x - x-. SVM , , . , Naive Bayes, "", " " .
- , tfidf ( 2 ),
, (, PCA)

, , , "".

, scikit-learn. , , , .

Jars · Answer 3 · 2016-04-28T12:49:33+0000

X_tfidf , sklearn.decomposition.NMF.

. , , ( ).

X_tfidf 20-D :

nmf = NMF(n_components=20)
nmf.fit(data)
X_transformed = nmf.transform(X_tf_idf)

"" - , ( - ).

X = scipy.sparse.hstack([X_transfored, X_metadata])

Other forecasts are possible, such as PCA, but thematic models through matrix factorizations, such as NMF or SVD , are common for text classification.

How to classify using TfidfVectorizer plus metadata in practice?

More articles: