I have a list of strings. If any line contains the character "#", then I want to extract the first part of the line and get the number of word samples from this part of the line. those. if the string "first question # in stackoverflow" the expected tokens are "first", "question"
If the string does not contain '#', then we return the tokens of the entire string.
To calculate the term matrix, I use the CountVectorizer from scikit.
Find my code below:
class MyTokenizer(object): def __call__(self,s): if(s.find('#')==-1): return s else: return s.split('#')[0] def FindKmeans(): text = ["first ques # on stackoverflow", "please help"] vec = CountVectorizer(tokenizer=MyTokenizer(), analyzer = 'word') pos_vector = vec.fit_transform(text).toarray() print(vec.get_feature_names())` output : [u' ', u'a', u'e', u'f', u'h', u'i', u'l', u'p', u'q', u'r', u's', u't', u'u'] Expected Output : [u'first', u'ques', u'please', u'help']
source share