Scikit Learn - Extract tokens from a string delimiter using CountVectorizer

I have a list of strings. If any line contains the character "#", then I want to extract the first part of the line and get the number of word samples from this part of the line. those. if the string "first question # in stackoverflow" the expected tokens are "first", "question"

If the string does not contain '#', then we return the tokens of the entire string.

To calculate the term matrix, I use the CountVectorizer from scikit.

Find my code below:

 class MyTokenizer(object): def __call__(self,s): if(s.find('#')==-1): return s else: return s.split('#')[0] def FindKmeans(): text = ["first ques # on stackoverflow", "please help"] vec = CountVectorizer(tokenizer=MyTokenizer(), analyzer = 'word') pos_vector = vec.fit_transform(text).toarray() print(vec.get_feature_names())` output : [u' ', u'a', u'e', u'f', u'h', u'i', u'l', u'p', u'q', u'r', u's', u't', u'u'] Expected Output : [u'first', u'ques', u'please', u'help'] 
+5
source share
3 answers

The problem lies with your tokenizer, you split the string into bits that you want to keep, and bits that you don't want to save, but you did not split the string into words. Try using the tokenizer below.

 class MyTokenizer(object): def __call__(self,s): if(s.find('#')==-1): return s.split(' ') else: return s.split('#')[0].split(' ') 
+2
source

You can split into your separator ( # ) no more than once and take the first part of the split.

 from sklearn.feature_extraction.text import CountVectorizer def tokenize(text): return([text.split('#', 1)[0].strip()]) text = ["first ques # on stackoverflow", "please help"] vec = CountVectorizer(tokenizer=tokenize) data = vec.fit_transform(text).toarray() vocab = vec.get_feature_names() required_list = [] for word in vocab: required_list.extend(word.split()) print(required_list) #['first', 'ques', 'please', 'help'] 
+2
source
  s.split('#',1)[0] 

# is your result. you do not need to verify that the "#" exists or not.

0
source

All Articles