CountVectorizer (analyzer = 'char_wb') does not work as expected

I am trying to use scikit-learn CountVectorizerto count 2 gram characters, ignoring spaces. The docs mention a parameter analyzerthat says

Should there be a function of a word or n-gram character. The parameter 'char_wb creates symbolic n-grams only from the text inside the border word.

However, "char_wb" does not work as I expected. For instance:

corpus = [
    "The blue dog Blue",
    "Green the green cat",
    "The green mouse",
]

# CountVectorizer character 2-grams with word boundaries
vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1) 
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()
[' b',
 ' c',
 ' d',
 ' g',
 ' m',
 ' t',
 'at',
 'bl',
 'ca', ....

Look for examples of type "b" that include a space. What gives?

+4
source share
1 answer

, , . , :

char_wb n-, n-, .

this commit, , ; . comment. bigrams analyzer='char', , , n-, . - - , n- . , , , n- !

+4

All Articles