Let me give you another lesson written by me. He answers your question, but also gives an explanation of why we do some things. I also tried to make it concise.
So you have list_of_documents , which is just an array of strings, and another document , which is just a string. You need to find such a document from list_of_documents , which is most similar to document .
Combine them together: documents = list_of_documents + [document]
Let's start with the dependencies. It will be clear why we use each of them.
from nltk.corpus import stopwords import string from nltk.tokenize import wordpunct_tokenize as tokenize from nltk.stem.porter import PorterStemmer from sklearn.feature_extraction.text import TfidfVectorizer from scipy.spatial.distance import cosine
One approach that can be used is the bag-of-words approach, where we process each word in a document independently of the others and simply drop them together in a large bag. From one point of view, it loses a lot of information (for example, how words are related), but from another point of view it makes the model simple.
In English and any other human language, there are many "useless" words, such as "a", "the", "in", which are so common that they do not really matter. They are called stopping words , and it is a good idea to delete them. Another thing that you can see is that the words "analysis", "analyzer", "analysis" are really similar. They have a common root, and all of them can be converted into one word. This process is called stemming and there are different stalkers that differ in speed, aggressiveness, etc. Therefore, we will convert each of the documents into a list of word stems without stop words. We also discard all punctuation marks.
porter = PorterStemmer() stop_words = set(stopwords.words('english')) modified_arr = [[porter.stem(i.lower()) for i in tokenize(d.translate(None, string.punctuation)) if i.lower() not in stop_words] for d in documents]
So how does this bag of words help us? Imagine that we have 3 bags: [a, b, c] , [a, c, a] and [b, c, d] . We can convert them to vectors in the database [a, b, c, d] . So, we end with the vectors: [1, 1, 1, 0] , [2, 0, 1, 0] and [0, 1, 1, 1] . A similar situation with our documents (only vectors will be a way longer). Now we see that we deleted a lot of words and stopped others to reduce the size of the vectors. There are only interesting observations. Longer documents will have more positive elements than shorter, so itβs nice to normalize the vector. This is called the time frequency TF, people also used additional information about how often the word is used in other documents - the reverse frequency of the IDF document. Together we have a TF-IDF metric that has several flavors . This can be achieved with a single line in sklearn :-)
modified_doc = [' '.join(i) for i in modified_arr] # this is only to convert our list of lists to list of strings that vectorizer uses. tf_idf = TfidfVectorizer().fit_transform(modified_doc)
In fact, the vectorizer allows you to do many things , for example, removing stop words and a subscript. I made them in a separate step only because sklearn does not have non-English stop words, but nltk does.
So, we calculate all the vectors. The last step is to find which one is most similar to the last. There are various ways to achieve this, one of which is the Euclidean distance, which is not so great due to the reason discussed here . Another approach is the resemblance to cosine . We go through all the documents and calculate the cosine similarity between the document and the last one:
l = len(documents) - 1 for i in xrange(l): minimum = (1, None) minimum = min((cosine(tf_idf[i].todense(), tf_idf[l + 1].todense()), i), minimum) print minimum
Now the minimum will contain information about the best document and its evaluation.