What is the correct way to mix sparse matrices with sklearn?

The other day I was dealing with a machine learning task that required extracting several types of feature matrices. I save these functional matrices as numpy arrays on disk, so that I can later use them in some evaluation (this was a classification task). In the end, when I wanted to use all the functions, I just concatenated the matrices to have a large functional matrix. When I received this large matrix of attributes, I presented it with an assessment.

I do not know if this works correctly with a feature matrix in which there are many templates. What other approaches should be used to mix several types of functions correctly? . However, looking at the documentation, I found FeatureUnion , which seems to be performing this task.

For example, suppose I would like to create a large feature matrix of three approaches to vectorization TfidfVectorizer, CountVectorizerand HashingVectorizerthis is what I tried after the documentation :

#Read the .csv file
import pandas as pd
df = pd.read_csv('file.csv',
                     header=0, sep=',', names=['id', 'text', 'labels'])

#vectorizer 1
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(use_idf=True, smooth_idf=True,
                             sublinear_tf=False, ngram_range=(2,2))
#vectorizer 2
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range=(2,2))

#vectorizer 3
from sklearn.feature_extraction.text import HashingVectorizer
hash_vect = HashingVectorizer(ngram_range=(2,2))


#Combine the above vectorizers in one single feature matrix:

from sklearn.pipeline import  FeatureUnion
combined_features = FeatureUnion([("tfidf_vect", tfidf_vect),
                                  ("bow", bow),
                                  ("hash",hash_vect)])

X_combined_features = combined_features.fit_transform(df['text'].values)
y = df['labels'].values

#Check the matrix
print X_combined_features.toarray()

Then:

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

Share the data:

from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X_combined_features,y, test_size=0.33)

, : , ? , "", , FeatureUnion 3- ?.

, :

A ((152, 33))

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

, numpy, :

B ((152, 10))

[[4210  228   25 ...,    0    0    0]
 [4490  180   96 ...,   10    4    6]
 [4795  139    8 ...,    0    0    1]
 ..., 
 [1475   58    3 ...,    0    0    0]
 [4668  256   25 ...,    0    0    0]
 [1955  111   10 ...,    0    0    0]]

C ((152, 46))

[[ 0  0  0 ...,  0  0  0]
 [ 0  0  0 ...,  0  0 17]
 [ 0  0  0 ...,  0  0  0]
 ..., 
 [ 0  0  0 ...,  0  0  0]
 [ 0  0  0 ...,  0  0  0]
 [ 0  0  0 ...,  0  0  0]]

A, B C numpy.hstack, scipy.sparse.hstack FeatureUnion?, , , , ?

+4
1

, ?

, FeatureUnion . , ( ).

, "", , FeatureUnion, ?

FeatureUnion, :

custom_vect = YourCustomVectorizer()
combined_features = FeatureUnion([("tfidf_vect", tfidf_vect),
                                  ("bow", bow),
                                  ("hash", hash_vect),
                                  ("custom", custom_vect])

, (, ), . ( ) , numpy.hstack scipy.sparse.hstack, .

, , FeatureUnion . , n_jobs, .


. , , .

+3

All Articles