Scikit-learn TfidfVectorizer Value?

I read about the TfidfVectorizer implementation of scikit-learn, I do not understand what is the result of this method, for example:

new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball'] new_term_freq_matrix = tfidf_vectorizer.transform(new_docs) print tfidf_vectorizer.vocabulary_ print new_term_freq_matrix.todense() 

exit:

 {u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2} [[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ] [ 0.62276601 0. 0. 0.62276601 0. 0. 0. 0.4736296 0. 0. 0. ]] 

What? (ex: u'me ': 8):

 {u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2} 

Is it a matrix or just a vector ?, I can’t understand what the conclusion tells me:

 [[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ] [ 0.62276601 0. 0. 0.62276601 0. 0. 0. 0.4736296 0. 0. 0. ]] 

Can someone explain these outputs in more detail to me?

Thanks!

+13
scikit-learn machine-learning nlp feature-extraction document-classification
source share
3 answers

TfidfVectorizer - conversion of text vectors into objects that can be used as input for evaluation.

vocabulary_ Is a dictionary that converts each token (word) into a function index in a matrix, each unique token receives a function index.

What? (ex: u'me ': 8)

It tells you that the "me" token is represented as function number 8 in the output matrix.

Is it a matrix or just a vector?

Each sentence is a vector, the introduced sentences are a matrix with 3 vectors. In each vector, numbers (weights) represent tf-idf attributes. For example: 'julie': 4 β†’ Tells you that in each sentence of 'Julie' a non-zero (tf-idf) weight will appear. As you can see in the 2nd vector:

[0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. 0.]

The 5th element scored 0.51785612 - tf-idf score for 'Julia'. For more information on calculating Tf-Idf, read here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

+11
source share

Thus, tf-idf creates a set of its vocabulary from the entire set of documents. What is visible in the first line of output. (for better understanding, I sorted it)

 {u'baseball': 0, u'basketball': 1, u'he': 2, u'jane': 3, u'julie': 4, u'likes': 5, u'linda': 6, u'loves': 7, u'me': 8, u'more': 9, u'than': 10, } 

And when the document is parsed to get its tf-idf. Document:

He is looking at basketball and baseball

and his way out

[0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. 0.]

equivalently

[baseball basketball, he jane julie loves linda loves me more]

Since our document contains only these words: baseball, basketball, it is from the created dictionary. The output of the document vector has tf-idf values ​​only for these three words and in the same sorted dictionary.

tf-idf is used to classify documents, ranking in a search engine. tf: time frequency (number of words contained in a document from its own dictionary), idf: frequency of the reverse document (word importance for each document).

+4
source share

The method takes into account the fact that all words should not be weighed equally, using weights to indicate the words that are most unique to the document and are best used to characterize it.

 new_docs = ['basketball baseball', 'basketball baseball', 'basketball baseball'] new_term_freq_matrix = vectorizer.fit_transform(new_docs) print (vectorizer.vocabulary_) print ((new_term_freq_matrix.todense())) {'basketball': 1, 'baseball': 0} [[ 0.70710678 0.70710678] [ 0.70710678 0.70710678] [ 0.70710678 0.70710678]] new_docs = ['basketball baseball', 'basketball basketball', 'basketball basketball'] new_term_freq_matrix = vectorizer.fit_transform(new_docs) print (vectorizer.vocabulary_) print ((new_term_freq_matrix.todense())) {'basketball': 1, 'baseball': 0} [[ 0.861037 0.50854232] [ 0. 1. ] [ 0. 1. ]] new_docs = ['basketball basketball baseball', 'basketball basketball', 'basketball basketball'] new_term_freq_matrix = vectorizer.fit_transform(new_docs) print (vectorizer.vocabulary_) print ((new_term_freq_matrix.todense())) {'basketball': 1, 'baseball': 0} [[ 0.64612892 0.76322829] [ 0. 1. ] [ 0. 1. ]] 
0
source share

All Articles