I have this code to calculate the similarity of text with tf-idf.
from sklearn.feature_extraction.text import TfidfVectorizer documents = [doc1,doc2] tfidf = TfidfVectorizer().fit_transform(documents) pairwise_similarity = tfidf * tfidf.T print pairwise_similarity.A
The problem is that this code accepts as input simple strings, and I want to prepare documents by removing stop words, stemning and tokkenize. So the entry will be a list. Error if I call documents = [doc1,doc2] with tokkenized documents:
Traceback (most recent call last): File "C:\Users\tasos\Desktop\my thesis\beta\similarity.py", line 18, in <module> tfidf = TfidfVectorizer().fit_transform(documents) File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 1219, in fit_transform X = super(TfidfVectorizer, self).fit_transform(raw_documents) File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 780, in fit_transform vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary) File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 715, in _count_vocab for feature in analyze(doc): File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 229, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words) File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 195, in <lambda> return lambda x: strip_accents(x.lower()) AttributeError: 'unicode' object has no attribute 'apply_freq_filter'
Is there a way to change the code and make it an accepted list, or can I again change the tokkenized documents to strings?
python scikit-learn tf-idf
Tasos
source share