I read about the TfidfVectorizer implementation of scikit-learn, I do not understand what is the result of this method, for example:
new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball'] new_term_freq_matrix = tfidf_vectorizer.transform(new_docs) print tfidf_vectorizer.vocabulary_ print new_term_freq_matrix.todense()
exit:
{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2} [[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ] [ 0.62276601 0. 0. 0.62276601 0. 0. 0. 0.4736296 0. 0. 0. ]]
What? (ex: u'me ': 8):
{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}
Is it a matrix or just a vector ?, I canβt understand what the conclusion tells me:
[[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ] [ 0.62276601 0. 0. 0.62276601 0. 0. 0. 0.4736296 0. 0. 0. ]]
Can someone explain these outputs in more detail to me?
Thanks!
scikit-learn machine-learning nlp feature-extraction document-classification
anon
source share