Python: tf-idf-cosine: find document affinities

Question

Python: tf-idf-cosine: find document affinities

I followed the tutorial that was available in part 1 and part 2 . Unfortunately, the author did not have time for the last section, which used cosine similarities to find the distance between the two documents. I followed the examples in the article, using the following link from https://stackoverflow.com/a/137189/ , included the code mentioned in the link above (just to make life easier)

from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords import numpy as np import numpy.linalg as LA train_set = ["The sky is blue.", "The sun is bright."] # Documents test_set = ["The sun in the sky is bright."] # Query stopWords = stopwords.words('english') vectorizer = CountVectorizer(stop_words = stopWords) #print vectorizer transformer = TfidfTransformer() #print transformer trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() testVectorizerArray = vectorizer.transform(test_set).toarray() print 'Fit Vectorizer to train set', trainVectorizerArray print 'Transform Vectorizer to test set', testVectorizerArray transformer.fit(trainVectorizerArray) print print transformer.transform(trainVectorizerArray).toarray() transformer.fit(testVectorizerArray) print tfidf = transformer.transform(testVectorizerArray) print tfidf.todense()

As a result of the above code, I have the following matrix

 Fit Vectorizer to train set [[1 0 1 0] [0 1 0 1]] Transform Vectorizer to test set [[0 1 1 1]] [[ 0.70710678 0. 0.70710678 0. ] [ 0. 0.70710678 0. 0.70710678]] [[ 0. 0.57735027 0.57735027 0.57735027]]

I am not sure how to use this output to calculate cosine similarity, I know how to realize cosine similarity with respect to two vectors of the same length, but here I am not sure how to identify these two vectors.

+71

python machine-learning nltk information-retrieval tf-idf

Null-Hypothesis Aug 25 2018-12-12T00:

source share

6 answers

First of all, if you want to extract the counting functions and apply TF-IDF normalization and normal Euclidean normalization per line, you can do this in one operation using the TfidfVectorizer :

 >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> from sklearn.datasets import fetch_20newsgroups >>> twenty = fetch_20newsgroups() >>> tfidf = TfidfVectorizer().fit_transform(twenty.data) >>> tfidf <11314x130088 sparse matrix of type '<type 'numpy.float64'>' with 1787553 stored elements in Compressed Sparse Row format>

Now, to find the cosine distances of one document (for example, the first in the dataset) and all the others, you just need to calculate the products of the points of the first vector with all the others, since the tfidf vectors are already string vectors, it is normalized. The scipy sparse matrix API is a little weird (not as flexible as dense numpy N-dimensional arrays). To get the first vector, you need to cut the matrix row by row to get the submatrix with one row:

 >>> tfidf[0:1] <1x130088 sparse matrix of type '<type 'numpy.float64'>' with 89 stored elements in Compressed Sparse Row format>

scikit-learn already provides pair metrics (aka kernels in machine learning language) that work for both dense and sparse representations of vector collections. In this case, we need a point product, which is also known as a linear core:

 >>> from sklearn.metrics.pairwise import linear_kernel >>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten() >>> cosine_similarities array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602, 0.04457106, 0.03293218])

Therefore, to find 5 related documents, we can use argsort and some negative slicing arrays (most related documents have the highest cosine similarity values, therefore, at the end of the array of sorted indices):

 >>> related_docs_indices = cosine_similarities.argsort()[:-5:-1] >>> related_docs_indices array([ 0, 958, 10576, 3277]) >>> cosine_similarities[related_docs_indices] array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])

The first result is a health check: we find the query document as the most similar document with an estimate of the similarity of cosine 1, which has the following text:

 >>> print twenty.data[0] From: lerxst@wam.umd.edu (where my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ----

The second most similar document is the answer, which quotes the original message, therefore, has many common words:

 >>> print twenty.data[958] From: rseymour@reed.edu (Robert Seymour) Subject: Re: WHAT car is this!? Article-ID: reed.1993Apr21.032905.29286 Reply-To: rseymour@reed.edu Organization: Reed College, Portland, OR Lines: 26 In article <1993Apr20.174246.14375@wam.umd.edu> lerxst@wam.umd.edu (where my thing) writes: > > I was wondering if anyone out there could enlighten me on this car I saw > the other day. It was a 2-door sports car, looked to be from the late 60s/ > early 70s. It was called a Bricklin. The doors were really small. In addition, > the front bumper was separate from the rest of the body. This is > all I know. If anyone can tellme a model name, engine specs, years > of production, where this car is made, history, or whatever info you > have on this funky looking car, please e-mail. Bricklins were manufactured in the 70s with engines from Ford. They are rather odd looking with the encased front bumper. There aren't a lot of them around, but Hemmings (Motor News) ususally has ten or so listed. Basically, they are a performance Ford with new styling slapped on top. > ---- brought to you by your neighborhood Lerxst ---- Rush fan? -- Robert Seymour rseymour@reed.edu Physics and Philosophy, Reed College (NeXTmail accepted) Artificial Life Project Reed College Reed Solar Energy Project (SolTrain) Portland, OR

+130

ogrisel Aug 26 '12 at 8:45

source share

I know his old post. but I tried http://scikit-learn.sourceforge.net/stable/ . here is my code to find the resemblance to cosine. The question is how will you calculate the cosine similarity with this package, and here is my code for this

 from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics.pairwise import cosine_similarity from sklearn.feature_extraction.text import TfidfVectorizer f = open("/root/Myfolder/scoringDocuments/doc1") doc1 = str.decode(f.read(), "UTF-8", "ignore") f = open("/root/Myfolder/scoringDocuments/doc2") doc2 = str.decode(f.read(), "UTF-8", "ignore") f = open("/root/Myfolder/scoringDocuments/doc3") doc3 = str.decode(f.read(), "UTF-8", "ignore") train_set = ["president of India",doc1, doc2, doc3] tfidf_vectorizer = TfidfVectorizer() tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set) #finds the tfidf score with normalization print "cosine scores ==> ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train) #here the first element of tfidf_matrix_train is matched with other three elements

Here, suppose the query is the first element of train_set and doc1, doc2 and doc3 are the documents that I want to rank using cosine similarity. then i can use this code.

Also the lesson provided in the question was very useful. Here are all the parts for him part-I , part-II , part-III

the output will be as follows:

 [[ 1. 0.07102631 0.02731343 0.06348799]]

here 1 represents that the request is mapped to itself, and the remaining three are estimates for the compliance of the request with the relevant documents.

+17

Gunjan Sep 20 '13 at 10:48

source share

Let me give you another lesson written by me. He answers your question, but also gives an explanation of why we do some things. I also tried to make it concise.

So you have list_of_documents , which is just an array of strings, and another document , which is just a string. You need to find such a document from list_of_documents , which is most similar to document .

Combine them together: documents = list_of_documents + [document]

Let's start with the dependencies. It will be clear why we use each of them.

 from nltk.corpus import stopwords import string from nltk.tokenize import wordpunct_tokenize as tokenize from nltk.stem.porter import PorterStemmer from sklearn.feature_extraction.text import TfidfVectorizer from scipy.spatial.distance import cosine

One approach that can be used is the bag-of-words approach, where we process each word in a document independently of the others and simply drop them together in a large bag. From one point of view, it loses a lot of information (for example, how words are related), but from another point of view it makes the model simple.

In English and any other human language, there are many "useless" words, such as "a", "the", "in", which are so common that they do not really matter. They are called stopping words , and it is a good idea to delete them. Another thing that you can see is that the words "analysis", "analyzer", "analysis" are really similar. They have a common root, and all of them can be converted into one word. This process is called stemming and there are different stalkers that differ in speed, aggressiveness, etc. Therefore, we will convert each of the documents into a list of word stems without stop words. We also discard all punctuation marks.

 porter = PorterStemmer() stop_words = set(stopwords.words('english')) modified_arr = [[porter.stem(i.lower()) for i in tokenize(d.translate(None, string.punctuation)) if i.lower() not in stop_words] for d in documents]

So how does this bag of words help us? Imagine that we have 3 bags: [a, b, c] , [a, c, a] and [b, c, d] . We can convert them to vectors in the database [a, b, c, d] . So, we end with the vectors: [1, 1, 1, 0] , [2, 0, 1, 0] and [0, 1, 1, 1] . A similar situation with our documents (only vectors will be a way longer). Now we see that we deleted a lot of words and stopped others to reduce the size of the vectors. There are only interesting observations. Longer documents will have more positive elements than shorter, so it’s nice to normalize the vector. This is called the time frequency TF, people also used additional information about how often the word is used in other documents - the reverse frequency of the IDF document. Together we have a TF-IDF metric that has several flavors . This can be achieved with a single line in sklearn :-)

 modified_doc = [' '.join(i) for i in modified_arr] # this is only to convert our list of lists to list of strings that vectorizer uses. tf_idf = TfidfVectorizer().fit_transform(modified_doc)

In fact, the vectorizer allows you to do many things , for example, removing stop words and a subscript. I made them in a separate step only because sklearn does not have non-English stop words, but nltk does.

So, we calculate all the vectors. The last step is to find which one is most similar to the last. There are various ways to achieve this, one of which is the Euclidean distance, which is not so great due to the reason discussed here . Another approach is the resemblance to cosine . We go through all the documents and calculate the cosine similarity between the document and the last one:

 l = len(documents) - 1 for i in xrange(l): minimum = (1, None) minimum = min((cosine(tf_idf[i].todense(), tf_idf[l + 1].todense()), i), minimum) print minimum

Now the minimum will contain information about the best document and its evaluation.

+15

Salvador Dali Sep 09 '15 at 7:40

source share

This will help you.

 from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(train_set) print tfidf_matrix cosine = cosine_similarity(tfidf_matrix[length-1], tfidf_matrix) print cosine

and the output will be:

 [[ 0.34949812 0.81649658 1. ]]

+10

Sam Apr 04 '14 at 18:29

source share

Here is a function that compares your test data with training data, with a Tf-Idf transformer equipped with training data. The advantage is that you can quickly rotate or group to find the n nearest elements, and that the calculations are performed in matrix order.

 def create_tokenizer_score(new_series, train_series, tokenizer): """ return the tf idf score of each possible pairs of documents Args: new_series (pd.Series): new data (To compare against train data) train_series (pd.Series): train data (To fit the tf-idf transformer) Returns: pd.DataFrame """ train_tfidf = tokenizer.fit_transform(train_series) new_tfidf = tokenizer.transform(new_series) X = pd.DataFrame(cosine_similarity(new_tfidf, train_tfidf), columns=train_series.index) X['ix_new'] = new_series.index score = pd.melt( X, id_vars='ix_new', var_name='ix_train', value_name='score' ) return score train_set = pd.Series(["The sky is blue.", "The sun is bright."]) test_set = pd.Series(["The sun in the sky is bright."]) tokenizer = TfidfVectorizer() # initiate here your own tokenizer (TfidfVectorizer, CountVectorizer, with stopwords...) score = create_tokenizer_score(train_series=train_set, new_series=test_set, tokenizer=tokenizer) score ix_new ix_train score 0 0 0 0.617034 1 0 1 0.862012

+1

Paul Ogier Sep 10 '18 at 17:05

source share

Null-Hypothesis · Accepted Answer · 2012-08-25 19:27

Using @excray's help, I can figure out the answer: we need to write a simple for loop to iterate over two arrays that represent train data and test data.

First, we implement a simple lambda function to calculate the formula for calculating the cosine:

 cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

And then just write a simple loop to iterate over the vector, the logic for each is "For each vector in trainVectorizerArray you need to find the similarity of the cosine to the vector in testVectorizerArray."

 from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords import numpy as np import numpy.linalg as LA train_set = ["The sky is blue.", "The sun is bright."] #Documents test_set = ["The sun in the sky is bright."] #Query stopWords = stopwords.words('english') vectorizer = CountVectorizer(stop_words = stopWords) #print vectorizer transformer = TfidfTransformer() #print transformer trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() testVectorizerArray = vectorizer.transform(test_set).toarray() print 'Fit Vectorizer to train set', trainVectorizerArray print 'Transform Vectorizer to test set', testVectorizerArray cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3) for vector in trainVectorizerArray: print vector for testV in testVectorizerArray: print testV cosine = cx(vector, testV) print cosine transformer.fit(trainVectorizerArray) print print transformer.transform(trainVectorizerArray).toarray() transformer.fit(testVectorizerArray) print tfidf = transformer.transform(testVectorizerArray) print tfidf.todense()

Here is the result:

 Fit Vectorizer to train set [[1 0 1 0] [0 1 0 1]] Transform Vectorizer to test set [[0 1 1 1]] [1 0 1 0] [0 1 1 1] 0.408 [0 1 0 1] [0 1 1 1] 0.816 [[ 0.70710678 0. 0.70710678 0. ] [ 0. 0.70710678 0. 0.70710678]] [[ 0. 0.57735027 0.57735027 0.57735027]]

Python: tf-idf-cosine: find document affinities

More articles: