How to access embed output (output vector) in gensim word2vec?

Question

How to access embed output (output vector) in gensim word2vec?

I want to use the output attachment of the word 2vec, for example, in this document (Improving the ranking of documents with double word inserts) .

I know that the input vectors are in syn0, the output vectors are in syn1 and syn1neg if the selection is negative.

But when I calculated most_similar with the output vector, I got the same result in some ranges due to the removal of syn1 or syn1neg.

Here is what I got.

IN[1]: model = Word2Vec.load('test_model.model') IN[2]: model.most_similar([model.syn1neg[0]]) OUT[2]: [('of', -0.04402521997690201), ('has', -0.16387106478214264), ('in', -0.16650712490081787), ('is', -0.18117375671863556), ('by', -0.2527652978897095), ('was', -0.254993200302124), ('from', -0.2659570872783661), ('the', -0.26878535747528076), ('on', -0.27521973848342896), ('his', -0.2930959463119507)]

but another syn1neg numpy vector already looks like an output.

 IN[3]: model.most_similar([model.syn1neg[50]]) OUT[3]: [('of', -0.07884830236434937), ('has', -0.16942456364631653), ('the', -0.1771494299173355), ('his', -0.2043554037809372), ('is', -0.23265135288238525), ('in', -0.24725285172462463), ('by', -0.27772971987724304), ('was', -0.2979024648666382), ('time', -0.3547973036766052), ('he', -0.36455872654914856)]

I want to get numpy output arrays (negative or not) with those stored during training.

Let me know how I can access pure syn1 or syn1neg or code, or some kind of word2vec module can get the output attachment.

+1

python numpy gensim word2vec

Suin SEO Mar 2 '17 at 11:31

source share

1 answer

gojomo · Accepted Answer · 2017-03-04T07:13:52+0000

With a negative sample, syn1neg weights are per word and in the same order as syn0 .

The fact that your two examples give similar results does not necessarily indicate that something is wrong. Words are sorted by default frequency, so early words (including at positions 0 and 50) are very common words with a very common match value (which may be close to each other).

Choose a mid-frequency word with a clearer meaning, and you can get more meaningful results (if your case / settings / needs adequately match your document "double word attachment"). For example, you can compare:

 model.most_similar('cousin')

... from...

 model.most_similar(positive=[model.syn1neg[model.vocab['cousin'].index])

However, in all cases, the existing method most_similar() searches only for similar vectors in syn0 - IN vectors in paper terminology. Thus, I believe that the above code will only really calculate what paper may call the “OUT-IN” similarity: a list of which IN vectors are most similar to a given OUT vector. In fact, they seem to reverse the "IN-OUT" resemblance as something useful. (These would be the OUT vectors most similar to the given IN vector.)

In recent versions of gensim, the KeyedVectors class is KeyedVectors to represent a set of vector words specified by a string separately from a specific Word2Vec model or other learning method. You could create an additional instance of KeyedVectors that replaces regular syn0 with syn1neg to get lists of OUT vectors similar to the target vector (and thus calculate the top-n 'IN-OUT' topology or even 'OUT -OUT' topology) .

For example, this may work (I have not tested it):

 outv = KeyedVectors() outv.vocab = model.wv.vocab # same outv.index2word = model.wv.index2word # same outv.syn0 = model.syn1neg # different inout_similars = outv.most_similar(positive=[model['cousin']])

syn1 exists only when using hierarchical selection, and it is less clear what an "output attachment" is for a single word. (There are several output nodes corresponding to the prediction of a single word, and all of them should be closer to the corresponding values 0/1 for predicting a single word. Therefore, unlike `syn1neg, there is no place to read the vector, which means you may need to calculate / get closer to some set of hidden-> output weights that will bring these multiple output nodes to the correct values.)

How to access embed output (output vector) in gensim word2vec?

More articles: