How to classify using TfidfVectorizer plus metadata in practice?

I am trying to classify some documents into two classes in which I use TfidfVectorizer as a feature extraction method.

The input data consists of data lines containing about a dozen data fields float, label and text block of the document body. In order to use the body, I applied TfidfVectorizer and got a sparse matrix (which I can examine by converting to an array via toarray ()). This matrix is ​​usually very large, in thousands of thousands of dimensions - let this F be 1000 x 15000 in size.

To use the classifier in Scikit, I give it the input matrix X, which (number of rows * number of functions). If I do not use the body, I may have X 1000 x 15 in size.

Here is the problem, suppose I add the horizontal stack of this F to X, so X will become 1000 x 15015, which will add a few problems: 1) The first 15 functions will now play a very small role; 2) Not enough memory;

Scikit provided an example that uses only the TfidfVectorizer input, but does not shed light on how to use it with metadata.

My question is: How do you use the output of TfidfVectorizer along with metadata to fit into the classifier for training?

Thanks.

+4
3
  • (tf-idf), X_tfidf.

  • , X_metadata.

  • :

    X = scipy.sparse.hstack([X_tfidf, X_metadata])
    
  • , :

    from sklearn.preprocessing import normalize
    X = normalize(X, copy=False)
    

, ​​ LinearSVC, LogisticRegression SGDClassifier, , ; . , , , .. .

( /, SVM k-NN, .)

+8

tf-idf , :

  • , , - , , 1% -. .
  • "" , , , , , , N_not_meta/N_meta, N_x - x-. SVM , , . , Naive Bayes, "", " " .
  • - , tfidf ( 2 ),
  • , (, PCA)

, , , "".

, scikit-learn. , , , .

+3

X_tfidf , sklearn.decomposition.NMF.

. , , ( ).

X_tfidf 20-D :

nmf = NMF(n_components=20)
nmf.fit(data)
X_transformed = nmf.transform(X_tf_idf)

"" - , ( - ).

X = scipy.sparse.hstack([X_transfored, X_metadata])

Other forecasts are possible, such as PCA, but thematic models through matrix factorizations, such as NMF or SVD , are common for text classification.

0
source

All Articles