Scikits-learn: use a special dictionary with Pipeline

In my scikits-learn Pipeline, I would like to pass a custom dictionary to CountVectorizer ():

text_classifier = Pipeline([
    ('count', CountVectorizer(vocabulary=myvocab)),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(C=1000))
])

However, as I understand it, when I call

text_classifier.fit(X_train, y_train)

Pipeline uses the fit_transform () method of CountVectorizer (), which ignores myvocab. How can I change my pipeline to use myvocab? Thank!

+5
source share
1 answer

It was a bug in scikit - find out what I fixed five minutes ago . Thank you for noticing this. I suggest you either upgrade to a new version from Github, or separate the vectorizer from the pipeline as a workaround:

count = CountVectorizer(vocabulary=myvocab)
X_vectorized = count.transform(X_train)

text_classifier = Pipeline([
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(C=1000))
])

text_classifier.fit(X_vectorized, y_train)

UPDATE. , scikit-learn.

+9

All Articles