The other day I was dealing with a machine learning task that required extracting several types of feature matrices. I save these functional matrices as numpy arrays on disk, so that I can later use them in some evaluation (this was a classification task). In the end, when I wanted to use all the functions, I just concatenated the matrices to have a large functional matrix. When I received this large matrix of attributes, I presented it with an assessment.
I do not know if this works correctly with a feature matrix in which there are many templates. What other approaches should be used to mix several types of functions correctly? . However, looking at the documentation, I found FeatureUnion , which seems to be performing this task.
For example, suppose I would like to create a large feature matrix of three approaches to vectorization TfidfVectorizer, CountVectorizerand HashingVectorizerthis is what I tried after the documentation :
import pandas as pd
df = pd.read_csv('file.csv',
header=0, sep=',', names=['id', 'text', 'labels'])
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(use_idf=True, smooth_idf=True,
sublinear_tf=False, ngram_range=(2,2))
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range=(2,2))
from sklearn.feature_extraction.text import HashingVectorizer
hash_vect = HashingVectorizer(ngram_range=(2,2))
from sklearn.pipeline import FeatureUnion
combined_features = FeatureUnion([("tfidf_vect", tfidf_vect),
("bow", bow),
("hash",hash_vect)])
X_combined_features = combined_features.fit_transform(df['text'].values)
y = df['labels'].values
print X_combined_features.toarray()
Then:
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
Share the data:
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X_combined_features,y, test_size=0.33)
, :
, ? , "", , FeatureUnion 3- ?.
, :
A ((152, 33))
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
, numpy, :
B ((152, 10))
[[4210 228 25 ..., 0 0 0]
[4490 180 96 ..., 10 4 6]
[4795 139 8 ..., 0 0 1]
...,
[1475 58 3 ..., 0 0 0]
[4668 256 25 ..., 0 0 0]
[1955 111 10 ..., 0 0 0]]
C ((152, 46))
[[ 0 0 0 ..., 0 0 0]
[ 0 0 0 ..., 0 0 17]
[ 0 0 0 ..., 0 0 0]
...,
[ 0 0 0 ..., 0 0 0]
[ 0 0 0 ..., 0 0 0]
[ 0 0 0 ..., 0 0 0]]
A, B C numpy.hstack, scipy.sparse.hstack FeatureUnion?, , , , ?