I am trying to classify some documents into two classes in which I use TfidfVectorizer as a feature extraction method.
The input data consists of data lines containing about a dozen data fields float, label and text block of the document body. In order to use the body, I applied TfidfVectorizer and got a sparse matrix (which I can examine by converting to an array via toarray ()). This matrix is usually very large, in thousands of thousands of dimensions - let this F be 1000 x 15000 in size.
To use the classifier in Scikit, I give it the input matrix X, which (number of rows * number of functions). If I do not use the body, I may have X 1000 x 15 in size.
Here is the problem, suppose I add the horizontal stack of this F to X, so X will become 1000 x 15015, which will add a few problems: 1) The first 15 functions will now play a very small role; 2) Not enough memory;
Scikit provided an example that uses only the TfidfVectorizer input, but does not shed light on how to use it with metadata.
My question is: How do you use the output of TfidfVectorizer along with metadata to fit into the classifier for training?
Thanks.