I am trying to use scikit-learn CountVectorizerto count 2 gram characters, ignoring spaces. The docs mention a parameter analyzerthat says
Should there be a function of a word or n-gram character. The parameter 'char_wb creates symbolic n-grams only from the text inside the border word.
However, "char_wb" does not work as I expected. For instance:
corpus = [
"The blue dog Blue",
"Green the green cat",
"The green mouse",
]
vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1)
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()
[' b',
' c',
' d',
' g',
' m',
' t',
'at',
'bl',
'ca', ....
Look for examples of type "b" that include a space. What gives?
source
share