I am using Pipeline from sklearn to classify text.
In this Pipeline example, I have a TfIDF vectorizer and some user-defined functions completed with FeatureUnion and the classifier as Pipeline steps, then I set the training data and make a prediction:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
features = []
measure_features = MeasureFeatures()
features.append(('measure_features', measure_features))
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
features.append(('ngram', countVecWord))
all_features = FeatureUnion(features)
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
pipeline = Pipeline(
[('all', all_features ),
('clf', LinearSVC1),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
The above code works fine, but there is a twist. I want to make a part of the speech tag in the text and use another Vectorizer in the text of the tag.
X = ['I am a sentence', 'an example']
X_tagged = do_tagging(X)
Y = [1, 2]
X_dev = ['another sentence']
X_dev_tagged = do_tagging(X_dev)
features = []
measure_features = MeasureFeatures()
features.append(('measure_features', measure_features))
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
countVecPOS = TfidfVectorizer(ngram_range=(1, 4), max_features= 2000)
features.append(('ngram', countVecWord))
features.append(('pos_ngram', countVecWord))
all_features = FeatureUnion(features)
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
pipeline = Pipeline(
[('all', all_features ),
('clf', LinearSVC1),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
How to choose this type of data? How can two vectorizers distinguish between source text and text? What are my options?
I also have custom functions, some of them will use the source text, and others will use the POS text.
EDIT: Added MeasureFeatures ()
from sklearn.base import BaseEstimator
import numpy as np
class MeasureFeatures(BaseEstimator):
def __init__(self):
pass
def get_feature_names(self):
return np.array(['type_token', 'count_nouns'])
def fit(self, documents, y=None):
return self
def transform(self, x_dataset):
X_type_token = list()
X_count_nouns = list()
for sentence in x_dataset:
X_type_token.append(type_token_ratio(sentence))
X_count_nouns.append(count_nouns(sentence))
X = np.array([X_type_token, X_count_nouns]).T
print X
print X.shape
if not hasattr(self, 'scalar'):
self.scalar = StandardScaler().fit(X)
return self.scalar.transform(X)
count_nouns(), type_token_ratio()