Add stack support to CountVectorizer (sklearn)

Question

Add stack support to CountVectorizer (sklearn)

I am trying to add output to my pipeline in NLP using sklearn.

from nltk.stem.snowball import FrenchStemmer stop = stopwords.words('french') stemmer = FrenchStemmer() class StemmedCountVectorizer(CountVectorizer): def __init__(self, stemmer): super(StemmedCountVectorizer, self).__init__() self.stemmer = stemmer def build_analyzer(self): analyzer = super(StemmedCountVectorizer, self).build_analyzer() return lambda doc:(self.stemmer.stem(w) for w in analyzer(doc)) stem_vectorizer = StemmedCountVectorizer(stemmer) text_clf = Pipeline([('vect', stem_vectorizer), ('tfidf', TfidfTransformer()), ('clf', SVC(kernel='linear', C=1)) ])

When using this pipeline with CountVectorizer sklearn, it works. And if I manually create functions like this, this also works.

 vectorizer = StemmedCountVectorizer(stemmer) vectorizer.fit_transform(X) tfidf_transformer = TfidfTransformer() X_tfidf = tfidf_transformer.fit_transform(X_counts)

EDIT

If I try this pipeline on my IPython Notebook, it will display [*] and nothing will happen. When I look at my terminal, it gives this error:

 Process PoolWorker-12: Traceback (most recent call last): File "C:\Anaconda2\lib\multiprocessing\process.py", line 258, in _bootstrap self.run() File "C:\Anaconda2\lib\multiprocessing\process.py", line 114, in run self._target(*self._args, **self._kwargs) File "C:\Anaconda2\lib\multiprocessing\pool.py", line 102, in worker task = get() File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\pool.py", line 360, in get return recv() AttributeError: 'module' object has no attribute 'StemmedCountVectorizer'

Example

Here is a complete example

 from sklearn.pipeline import Pipeline from sklearn import grid_search from sklearn.svm import SVC from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer from nltk.stem.snowball import FrenchStemmer stemmer = FrenchStemmer() analyzer = CountVectorizer().build_analyzer() def stemming(doc): return (stemmer.stem(w) for w in analyzer(doc)) X = ['le chat est beau', 'le ciel est nuageux', 'les gens sont gentils', 'Paris est magique', 'Marseille est tragique', 'JCVD est fou'] Y = [1,0,1,1,0,0] text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SVC())]) parameters = { 'vect__analyzer': ['word', stemming]} gs_clf = grid_search.GridSearchCV(text_clf, parameters, n_jobs=-1) gs_clf.fit(X, Y)

If you remove the output from the parameters, it works differently, it does not work.

UPDATE

The problem is the parallelization process, because when you delete n_jobs = -1, the problem disappears.

+8

python scikit-learn nlp

dooms Mar 23 '16 at 15:37

source share

3 answers

joeln · Answer 1 · 2016-03-24T00:46:47+0000

You can pass the called analyzer code to the CountVectorizer constructor to provide a custom analyzer. It seems to work for me.

 from sklearn.feature_extraction.text import CountVectorizer from nltk.stem.snowball import FrenchStemmer stemmer = FrenchStemmer() analyzer = CountVectorizer().build_analyzer() def stemmed_words(doc): return (stemmer.stem(w) for w in analyzer(doc)) stem_vectorizer = CountVectorizer(analyzer=stemmed_words) print(stem_vectorizer.fit_transform(['Tu marches dans la rue'])) print(stem_vectorizer.get_feature_names())

Prints out:

  (0, 4) 1 (0, 2) 1 (0, 0) 1 (0, 1) 1 (0, 3) 1 [u'dan', u'la', u'march', u'ru', u'tu']

Parth gupta · Answer 2 · 2016-12-29T10:11:04+0000

I know that it's a little late for me to post my answer. But here it is, if someone still needs help.

Below is the cleanest approach to adding a dictionary to count a counter by overriding build_analyser()

 from sklearn.feature_extraction.text import CountVectorizer import nltk.stem french_stemmer = nltk.stem.SnowballStemmer('french') class StemmedCountVectorizer(CountVectorizer): def build_analyzer(self): analyzer = super(StemmedCountVectorizer, self).build_analyzer() return lambda doc: ([french_stemmer.stem(w) for w in analyzer(doc)]) vectorizer_s = StemmedCountVectorizer(min_df=3, analyzer="word", stop_words='french')

You can freely call the fit and transform functions of the CountVectorizer class on top of your vectorizer_s object

Till · Answer 3 · 2016-03-23T16:17:21+0000

You can try:

 def build_analyzer(self): analyzer = super(CountVectorizer, self).build_analyzer() return lambda doc:(stemmer.stem(w) for w in analyzer(doc))

and remove the __init__ method.

Add stack support to CountVectorizer (sklearn)

More articles: