How to import word2vec into TensorFlow Seq2Seq model?

I play with the Tensorflow sequence in the sequence translation model. I was wondering if I can import my own word2vec into this model? Instead of using your original "tight presentation" mentioned in the textbook.

From my point of view, it seems TensorFlow uses a single-hot view for the seq2seq model. Firstly, for the tf.nn.seq2seq.embedding_attention_seq2seq function, the tf.nn.seq2seq.embedding_attention_seq2seq input is a symbolic symbol, for example. "a" is "4" and "dog" is "15715", etc. and requires num_encoder_symbols. Therefore, I think that it makes me indicate the position of the word and the total number of words, then the function can represent the word in the "One-Hot" view. I am still studying the source code, but it is hard to understand.

Can someone give me an idea of ​​the above problem?

+2
source share
2 answers

The seq2seq embedding_ * functions do create embedding matrices very similar to those from word2vec. They are a variable called sth, like this:

EMBEDDING_KEY = "embedding_attention_seq2seq / RNN / EmbeddingWrapper / embedding"

Knowing this, you can simply change this variable. I mean - get your word2vec vectors in some format, say a text file. Assuming you have your own vocabulary in model.vocab, you can then assign the read vectors in the manner illustrated by the snippet below (this is just a snippet, you will have to modify it to work, but I hope this shows the idea).

  vectors_variable = [v for v in tf.trainable_variables() if EMBEDDING_KEY in v.name] if len(vectors_variable) != 1: print("Word vector variable not found or too many.") sys.exit(1) vectors_variable = vectors_variable[0] vectors = vectors_variable.eval() print("Setting word vectors from %s" % FLAGS.word_vector_file) with gfile.GFile(FLAGS.word_vector_file, mode="r") as f: # Lines have format: dog 0.045123 -0.61323 0.413667 ... for line in f: line_parts = line.split() # The first part is the word. word = line_parts[0] if word in model.vocab: # Remaining parts are components of the vector. word_vector = np.array(map(float, line_parts[1:])) if len(word_vector) != vec_size: print("Warn: Word '%s', Expecting vector size %d, found %d" % (word, vec_size, len(word_vector))) else: vectors[model.vocab[word]] = word_vector # Assign the modified vectors to the vectors_variable in the graph. session.run([vectors_variable.initializer], {vectors_variable.initializer.inputs[1]: vectors}) 
+2
source

I assume that using the style of the area Matthew spoke about, you can get the variable:

  with tf.variable_scope("embedding_attention_seq2seq"): with tf.variable_scope("RNN"): with tf.variable_scope("EmbeddingWrapper", reuse=True): embedding = vs.get_variable("embedding", [shape], [trainable=]) 

In addition, I would suggest that you also want to embed the attachments in the decoder, the key (or region) for it would be something like this:

"embedding_attention_seq2seq / embedding_attention_decoder / attachment"


Thanks for your reply, Lukash!

I was wondering what exactly in the code snippet <b>model.vocab[word]</b> means? Just the position of a word in a dictionary?

In this case, it would not be faster to iterate through the dictionary and enter the w2v vectors for the words that exist in the w2v model.

+1
source

All Articles