Scikit Learn - Extract tokens from a string delimiter using CountVectorizer

Question

Scikit Learn - Extract tokens from a string delimiter using CountVectorizer

I have a list of strings. If any line contains the character "#", then I want to extract the first part of the line and get the number of word samples from this part of the line. those. if the string "first question # in stackoverflow" the expected tokens are "first", "question"

If the string does not contain '#', then we return the tokens of the entire string.

To calculate the term matrix, I use the CountVectorizer from scikit.

Find my code below:

 class MyTokenizer(object): def __call__(self,s): if(s.find('#')==-1): return s else: return s.split('#')[0] def FindKmeans(): text = ["first ques # on stackoverflow", "please help"] vec = CountVectorizer(tokenizer=MyTokenizer(), analyzer = 'word') pos_vector = vec.fit_transform(text).toarray() print(vec.get_feature_names())` output : [u' ', u'a', u'e', u'f', u'h', u'i', u'l', u'p', u'q', u'r', u's', u't', u'u'] Expected Output : [u'first', u'ques', u'please', u'help']

+5

python scikit-learn machine-learning

Rashmi singh Aug 2 '16 at 8:19

source share

3 answers

You can split into your separator ( # ) no more than once and take the first part of the split.

 from sklearn.feature_extraction.text import CountVectorizer def tokenize(text): return([text.split('#', 1)[0].strip()]) text = ["first ques # on stackoverflow", "please help"] vec = CountVectorizer(tokenizer=tokenize) data = vec.fit_transform(text).toarray() vocab = vec.get_feature_names() required_list = [] for word in vocab: required_list.extend(word.split()) print(required_list) #['first', 'ques', 'please', 'help']

+2

Nickil maveli Aug 2 '16 at 9:30

source share

  s.split('#',1)[0]

# is your result. you do not need to verify that the "#" exists or not.

0

redratear Aug 2 '16 at 9:51

source share

ncfirth · Accepted Answer · 2016-08-02T09:21:50+0000

The problem lies with your tokenizer, you split the string into bits that you want to keep, and bits that you don't want to save, but you did not split the string into words. Try using the tokenizer below.

 class MyTokenizer(object): def __call__(self,s): if(s.find('#')==-1): return s.split(' ') else: return s.split('#')[0].split(' ')

Scikit Learn - Extract tokens from a string delimiter using CountVectorizer

More articles: