I use Word2vec through gensim with pre-processed Google vectors trained in Google News. I noticed that the vector words that I can access by directly searching by the index of the Word2Vec object Word2Vec not unit vectors:
>>> import numpy >>> from gensim.models import Word2Vec >>> w2v = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) >>> king_vector = w2v['king'] >>> numpy.linalg.norm(king_vector) 2.9022589
However, in most_similar these nonunit vectors are not used; instead, normalized versions are used from the undocumented .syn0norm property, which contains only unit vectors:
>>> w2v.init_sims() >>> unit_king_vector = w2v.syn0norm[w2v.vocab['king'].index] >>> numpy.linalg.norm(unit_king_vector) 0.99999994
A larger vector is simply an extended version of a single vector:
>>> king_vector - numpy.linalg.norm(king_vector) * unit_king_vector array([ 0.00000000e+00, -1.86264515e-09, 0.00000000e+00, 0.00000000e+00, -1.86264515e-09, 0.00000000e+00, -7.45058060e-09, 0.00000000e+00, 3.72529030e-09, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ... (some lines omitted) ... -1.86264515e-09, -3.72529030e-09, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00], dtype=float32)
Given that word similarity comparisons in Word2Vec are similar to cosines , it doesn't seem to me that the lengths of the unnormalized vectors mean - although I assume they mean something, since gensim provides them to me, and not just exposes the unit vectors in .syn0norm .
How are the lengths of these unnormalized Word2vec vectors formed, and what is their meaning? For what calculations does it make sense to use normalized vectors and when should non-normalized ones be used?