Scikits-learn: use a special dictionary with Pipeline

Question

Scikits-learn: use a special dictionary with Pipeline

In my scikits-learn Pipeline, I would like to pass a custom dictionary to CountVectorizer ():

text_classifier = Pipeline([
    ('count', CountVectorizer(vocabulary=myvocab)),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(C=1000))
])

However, as I understand it, when I call

text_classifier.fit(X_train, y_train)

Pipeline uses the fit_transform () method of CountVectorizer (), which ignores myvocab. How can I change my pipeline to use myvocab? Thank!

+5

python scikit-learn machine-learning scikits

mathias Jul 07 '11 at 9:08

source share

1 answer

Fred Foo · Accepted Answer · 2011-07-08T23:19:05+0000

It was a bug in scikit - find out what I fixed five minutes ago . Thank you for noticing this. I suggest you either upgrade to a new version from Github, or separate the vectorizer from the pipeline as a workaround:

count = CountVectorizer(vocabulary=myvocab)
X_vectorized = count.transform(X_train)

text_classifier = Pipeline([
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(C=1000))
])

text_classifier.fit(X_vectorized, y_train)

UPDATE. , scikit-learn.

Scikits-learn: use a special dictionary with Pipeline

More articles: