Tfidf algorithm for python

Question

Tfidf algorithm for python

I have this code to calculate the similarity of text with tf-idf.

from sklearn.feature_extraction.text import TfidfVectorizer documents = [doc1,doc2] tfidf = TfidfVectorizer().fit_transform(documents) pairwise_similarity = tfidf * tfidf.T print pairwise_similarity.A

The problem is that this code accepts as input simple strings, and I want to prepare documents by removing stop words, stemning and tokkenize. So the entry will be a list. Error if I call documents = [doc1,doc2] with tokkenized documents:

  Traceback (most recent call last): File "C:\Users\tasos\Desktop\my thesis\beta\similarity.py", line 18, in <module> tfidf = TfidfVectorizer().fit_transform(documents) File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 1219, in fit_transform X = super(TfidfVectorizer, self).fit_transform(raw_documents) File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 780, in fit_transform vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary) File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 715, in _count_vocab for feature in analyze(doc): File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 229, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words) File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 195, in <lambda> return lambda x: strip_accents(x.lower()) AttributeError: 'unicode' object has no attribute 'apply_freq_filter'

Is there a way to change the code and make it an accepted list, or can I again change the tokkenized documents to strings?

+7

python scikit-learn tf-idf

Tasos Aug 25 '13 at 18:30

source share

1 answer

chlunde · Accepted Answer · 2013-08-25T23:06:18+0000

Try skipping preprocessing in lowercase and provide your "nop" tokenizer:

 tfidf = TfidfVectorizer(tokenizer=lambda doc: doc, lowercase=False).fit_transform(documents)

You should also check other parameters like stop_words to avoid duplication of preprocessing.

Tfidf algorithm for python

More articles: