What does the length of a Word2vec vector matter?

I use Word2vec through gensim with pre-processed Google vectors trained in Google News. I noticed that the vector words that I can access by directly searching by the index of the Word2Vec object Word2Vec not unit vectors:

 >>> import numpy >>> from gensim.models import Word2Vec >>> w2v = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) >>> king_vector = w2v['king'] >>> numpy.linalg.norm(king_vector) 2.9022589 

However, in most_similar these nonunit vectors are not used; instead, normalized versions are used from the undocumented .syn0norm property, which contains only unit vectors:

 >>> w2v.init_sims() >>> unit_king_vector = w2v.syn0norm[w2v.vocab['king'].index] >>> numpy.linalg.norm(unit_king_vector) 0.99999994 

A larger vector is simply an extended version of a single vector:

 >>> king_vector - numpy.linalg.norm(king_vector) * unit_king_vector array([ 0.00000000e+00, -1.86264515e-09, 0.00000000e+00, 0.00000000e+00, -1.86264515e-09, 0.00000000e+00, -7.45058060e-09, 0.00000000e+00, 3.72529030e-09, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ... (some lines omitted) ... -1.86264515e-09, -3.72529030e-09, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00], dtype=float32) 

Given that word similarity comparisons in Word2Vec are similar to cosines , it doesn't seem to me that the lengths of the unnormalized vectors mean - although I assume they mean something, since gensim provides them to me, and not just exposes the unit vectors in .syn0norm .

How are the lengths of these unnormalized Word2vec vectors formed, and what is their meaning? For what calculations does it make sense to use normalized vectors and when should non-normalized ones be used?

+6
source share
1 answer

I apologize for the preliminary translation.

The objective function of word embedding models is to maximize the log-likelihood of model data. In word2vec, this is achieved by minimizing the dot product (normalized using softmax ) of the predicted vector (using context) and the actual vector (current view) of the word, taking into account the context of the word.

Note that the task for which vocabulary vectors are trained is either to predict the context of the given word or the word specified by the context ( skip-gram vs cbow ). The length of the word vectors does not matter as such, but as it turned out, the vectors themselves have interesting properties / applications.

To find similar words , you will find words with the maximum cosine similarity (equivalent to searching for words with the minimum Euclidean distance after the unit normalizing the vectors, check ) , which performs most_similar .

To find analogies , we can simply use the vector of the difference (or direction) between the raw vector representations of the word vectors. For instance,

  • v ('Paris') - v ('France') ~ v ('Rome') - v ('Italy') `
  • v ('good') - v ('bad') ~ v (happy) - v ('sad')

In gensim ,

 model = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) model.most_similar(positive=['good', 'sad'], negative=['bad']) [(u'wonderful', 0.6414928436279297), (u'happy', 0.6154338121414185), (u'great', 0.5803680419921875), (u'nice', 0.5683973431587219), (u'saddening', 0.5588893294334412), (u'bittersweet', 0.5544661283493042), (u'glad', 0.5512036681175232), (u'fantastic', 0.5471092462539673), (u'proud', 0.530515193939209), (u'saddened', 0.5293528437614441)] 

Literature:

  • GloVe : Global vectors for representing words
  • word2vec Learning Parameters - Paper
  • Linguistic patterns in representations of continuous space - paper
  • Word2vec

Copy answer to related (not answered question yet)

+3
source

All Articles