Use scikit learn tfidf vectorizer starting with counts data frame

Question

Use scikit learn tfidf vectorizer starting with counts data frame

I have a pandas data frame with the number of words for a number of documents. Can I apply sklearn.feature_extraction.text.TfidfVectorizerto it to return the term-document matrix?

import pandas as pd

a = [1,2,3,4]
b = [1,3,4,6]
c = [3,4,6,1]

df = pd.DataFrame([a,b,c])

How to get tfidf version of counts in df?

+4

python scikit-learn nlp tf-idf

Adj Feb 14 '15 at 0:20

source share

1 answer

Jab · Answer 1 · 2015-02-16T02:57:08+0000

like this:

from sklearn.feature_extraction.text import TfidfTransformer
tfidf =TfidfTransformer(norm=u'l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
data =tfidf.fit_transform(df.values)

This returns a sparse matrix of tfidf values. You can turn them into dense ones and return them to a data frame as follows:

pd.DataFrame(data.todense())

Use scikit learn tfidf vectorizer starting with counts data frame

More articles: