How to calculate the similarity of sentences using the word2vec gensim model with python

According to Gensim Word2Vec , I can use the word2vec model in the gensim package to calculate the similarities between two words.

eg.

trained_model.similarity('woman', 'man') 0.73723527 

However, the word2vec model cannot predict the similarity of sentences. I recognized the LSI model with a similar sentence in gensim, but it doesn't seem to be compatible with the word2vec model. The body length of each sentence is not very long (shorter than 10 words). So, are there any simple ways to achieve this?

+88
python gensim word2vec
Mar 02 '14 at 16:04
source share
12 answers

This is actually a pretty complicated problem that you are asking for. Calculating the similarity of sentences requires building a grammatical model of the sentence, understanding equivalent structures (for example, “he went to the store yesterday” and “yesterday, he went to the store”), revealing similarities not only in pronouns and verbs, but also in regular nouns, searching for statistical matches / relationships in a variety of real-life text examples, etc.

The simplest thing you could try - although I don’t know how good it would be, and it certainly would not give you optimal results - would be to first delete all the “stop words” (words like “the” , "an", etc., which do not add much meaning to the sentence), and then run word2vec in the words in both sentences, sum the vectors in one sentence, sum the vectors in the other sentence, and then find the difference between the sums. By summing them up instead of making a difference in words, you will at least not obey the word order. That being said, it will fail in many ways and will not be a good solution in any way (although good solutions to this problem are almost always associated with some NLP, machine learning, and other smarts).

So the short answer is: no, there is no easy way to do this (at least not to do it well).

+66
Mar 02 '14 at 16:18
source share
— -

Since you are using gensim, you should probably use its implementation of doc2vec. doc2vec is an extension of word2vec to phrase-, sentence- and document level. This is a fairly simple extension described here.

http://cs.stanford.edu/~quocle/paragraph_vector.pdf

Gensim is good because it is intuitive, fast and flexible. It's great that you can get pre-prepared word embeddings from the official word2vec page, and the gensim Doc2Vec syn0 layer is presented so that you can fill in the word embeddings with these high-quality vectors!

GoogleNews-vectors-negative300.bin.gz

I think gensim is by far the easiest (and so far the best for me) tool for embedding sentences in vector space.

There are other methods sentence- for the vector than the one that was proposed in the article by Le and Mikolov above. Socher and Manning from Stanford are by far the two most famous researchers working in this field. Their work is based on the principle of compositional - the semantics of sentences comes from:

 1. semantics of the words 2. rules for how these words interact and combine into phrases 

They proposed several such models (becoming more and more complex) on how to use composition to build sentence- level representations.

2011 - Deploying a recursive auto-encoder (very simple. Start here if interested)

2012 - matrix-vector neural network

2013 - neural tensor network

2015 - LSTM Tree

all his works are available on socher.org. Some of these models are available, but I still recommend gensim doc2vec. Firstly, the 2011 URAE is not particularly strong. In addition, it is pre-prepared with scales suitable for paraphrasing news data. The code it provides does not allow you to retrain the network. You also cannot swap different word vectors, so you are stuck with the 2011 attachments of pre-word2vec from Turian. These vectors, of course, are not at the level of word2vec or GloVe.

Not yet worked with Tree LSTM, but it looks very promising!

tl; dr Yes, use gensim doc2vec. But other methods exist!

+62
Jul 14 '15 at 20:54
source share

If you use word2vec, you need to calculate the average vector for all words in each sentence / document and use the cosine similarity between the vectors:

 import numpy as np from scipy import spatial index2word_set = set(model.wv.index2word) def avg_feature_vector(sentence, model, num_features, index2word_set): words = sentence.split() feature_vec = np.zeros((num_features, ), dtype='float32') n_words = 0 for word in words: if word in index2word_set: n_words += 1 feature_vec = np.add(feature_vec, model[word]) if (n_words > 0): feature_vec = np.divide(feature_vec, n_words) return feature_vec 

Calculate the similarity:

 s1_afv = avg_feature_vector('this is a sentence', model=model, num_features=300, index2word_set=index2word_set) s2_afv = avg_feature_vector('this is also sentence', model=model, num_features=300, index2word_set=index2word_set) sim = 1 - spatial.distance.cosine(s1_afv, s2_afv) print(sim) > 0.915479828613 
+27
Jan 29 '16 at 19:09
source share

You can use the word spacing algorithm. here is a simple description about WMD .

 #load word2vec model, here GoogleNews is used model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True) #two sample sentences s1 = 'the first sentence' s2 = 'the second text' #calculate distance between two sentences using WMD algorithm distance = model.wmdistance(s1, s2) print ('distance = %.3f' % distance) 

Ps: If you encounter an error importing the pyemd library, you can install it using the following command:

 pip install pyemd 
+18
Oct 02 '17 at 11:25
source share

Once you calculate the sum of two sets of word vectors, you should take the cosine between the vectors, not diff. The cosine can be calculated by taking the point product of two normalized vectors. Therefore, the number of words is not a factor.

+16
May 05 '14 at 1:35
source share

I am using the following method and it works well. First you need to run POSTagger, and then filter your sentence to get rid of stop words (determinants, unions, ...). I recommend TextBlob APTagger . Then you build word2vec, taking the average value of each word in the sentence. The n_similarity method in Gemsim word2vec does just that, letting you compare two sets of words for comparison.

+8
Oct 08 '14 at 14:29
source share

I would like to update the existing solution to help people who are going to calculate the semantic similarity of sentences.

Step 1:

Download the appropriate model with gensim and compute the vector words for the words in the sentence and save them as a list of words

Step 2: Calculating the Proposal Vector

The calculation of semantic similarity between sentences was hindered until recently by paper called " SIMPLE BUT THE BASIC CATEGORY FOR EMBEDDINGS ", which offers a simple approach by calculating the weighted average of word vectors in a sentence and then removing the projections of the average vectors on their first main component. If the weight of the word w is / (a ​​+ p (w)), moreover, the parameter a is the parameter and p (w) is the (estimated) word frequency, called the smooth inverse frequency. This method works much better.

A simple code for calculating a sentence vector using SIF (smooth inverse frequency), the method proposed in the document was presented here

Step 3: using sklearn cosine_similarity load two vectors for sentences and calculate the similarity.

This is the simplest and most effective method for calculating the similarity of sentences.

+8
Jun 06 '17 at 7:02
source share

There are Word2Vec extensions designed to solve the problem of comparing longer pieces of text, such as phrases or sentences. One of them is para2vec or doc2vec.

"Distributed Submissions of Proposals and Documents" http://cs.stanford.edu/~quocle/paragraph_vector.pdf

http://rare-technologies.com/doc2vec-tutorial/

+6
Aug 24 '15 at 22:56
source share

I tried the methods provided by the previous answers. This works, but the main drawback of this is that the longer the sentences will have more similarities (to calculate the similarity I use the cosine estimate of the two average investments of any two sentences), since the more words, the more positive semantic effects will be added to the sentence .

I thought I should change my mind and use nesting suggestions instead, as described in this article and this .

+3
Nov 18 '16 at 7:07
source share

There is a function from the documentation that takes a list of words and compares their similarities.

 s1 = 'This room is dirty' s3 = 'dirty and disgusting room' distance = model.wv.n_similarity(s1.lower().split(), s2.lower().split()) 
+2
May 22 '18 at 14:34
source share

The Facebook research team has released a new solution called InferSent. Results and code published on Github, check their repository. This is pretty cool. I plan to use this. https://github.com/facebookresearch/InferSent

their article is https://arxiv.org/abs/1705.02364 Abstract: Many modern NLP systems use the embedding of words previously trained in an uncontrolled way in large buildings as basic functions. However, attempts to get attachments for large chunks of text, such as sentences, were not so successful. Several attempts to study uncontrolled representations of sentences have not reached enough performance to be widely accepted. In this article, we show how universal sentence representations trained using controlled data from the Stanford Natural Language Inference datasets can consistently outperform uncontrolled methods, such as SkipThought vectors, in a wide variety of transfer tasks. Just as computer vision uses ImageNet to obtain functions that can then be transferred to other tasks, our work, as a rule, indicates the suitability of logical inference for transferring training to other NLP tasks. Our encoder is publicly available.

+2
Sep 14 '18 at 3:12
source share

Gensim implements a model called Doc2Vec for embedding paragraphs .

There are various tutorials presented as IPython notebooks:

Another method will rely on Word2Vec and Word Mover Distance (WMD), as shown in this guide:

An alternative solution would be to rely on medium vectors:

 from gensim.models import KeyedVectors from gensim.utils import simple_preprocess def tidy_sentence(sentence, vocabulary): return [word for word in simple_preprocess(sentence) if word in vocabulary] def compute_sentence_similarity(sentence_1, sentence_2, model_wv): vocabulary = set(model_wv.index2word) tokens_1 = tidy_sentence(sentence_1, vocabulary) tokens_2 = tidy_sentence(sentence_2, vocabulary) return model_wv.n_similarity(tokens_1, tokens_2) wv = KeyedVectors.load('model.wv', mmap='r') sim = compute_sentence_similarity('this is a sentence', 'this is also sentence', wv) print(sim) 

Finally, if you can run Tensorflow, you can try: https://tfhub.dev/google/universal-sentence-encoder/2

0
Jan 23 '19 at 15:29
source share



All Articles