Gensim word2vec by predefined dictionary and word-index data

I need to prepare the word2vec presentation on tweets using gensim. Unlike most tutorials and code that I saw in gensim, my data is not raw, but has already been pre-processed. I have a dictionary in a text document containing 65 thousand words (including the "unknown" token and EOL token), and the tweets are saved as a numpy matrix with indexes in this dictionary. The following is a simple example of a data format:

dict.txt

you love this code 

tweets (5 unknown, and 6 - EOL)

 [[0, 1, 2, 3, 6], [3, 5, 5, 1, 6], [0, 1, 3, 6, 6]] 

I am not sure how I should handle the presentation of indexes. A simple way is to simply convert the list of indexes to a list of strings (ie [0, 1, 2, 3, 6] → ['0', '1', '2', '3', '6']), when i read it in word2vec model. However, this should be inefficient, since gensim will then try to find the internal index used, for example. '2'.

How to load this data and create a word2vec view in an efficient way using gensim?

+7
python nlp gensim word2vec
source share
2 answers

The usual way to initialize a Word2Vec model in gensim is [1]

 model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4) 

The question is, what are sentences ? sentences should be an iterator of iterations of words / tokens. This is similar to the numpy matrix, but each row may have a different length.

If you look at the documentation for gensim.models.word2vec.LineSentence , it gives you the ability to directly download text files as sentences. As a hint, according to the documentation, required

one sentence = one line; words already pre-processed and separated by spaces.

When he says words already preprocessed , this refers to filters with a lower case, stem, stopwatch and all other text cleaning processes. In your case, you do not want 5 and 6 be on your list of offers, so you need to filter them out.

Given that you already have a numpy matrix, assuming each row is a sentence, it is better to then translate it into a 2d array and filter out all 5 and 6 . The resulting 2d array can be used directly as the sentences argument to initialize the model. The only catch is that when you want to request a model after training, you need to enter indexes instead of tokens.

Now you have one question: if the model takes an integer directly. In the Python version, it does not check the type and simply passes unique tokens. Your unique indicators in this case will work fine. But most of the time you would like to use the C-Extended procedure to train your model, which is very important because it can give 70 times the performance. [2] I assume that in this case, the C code can check the type of the string, which means that the string-to-index mapping is stored.

Is it really ineffective? I think not, because the lines you have are numbers that are generally much shorter than the real token they represent (assuming they are compact indexes from 0 ). Therefore, the models will be smaller in size, which will save some effort on serializing and deserializing the model at the end. You, in fact, encoded the input tokens in a shorter string format and separated it from Word2Vec training, and the Word2Vec model Word2Vec not and should not know that this encoding occurred before the training.

My philosophy is try the simplest way first . I would just throw away a sample test input of integers into the model and see what could go wrong. Hope this helps.

[1] https://radimrehurek.com/gensim/models/word2vec.html

[2] http://rare-technologies.com/word2vec-in-python-part-two-optimizing/

+7
source share

I had the same problem. Even converting to an array of strings through

 >>> arr_str = np.char.mod('%d', arr) 

threw an exception when starting Word2Vec:

 >>> model = Word2Vec(arr_str) ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() 

My solution was to write an array of integers as text, and then use word2vec with LineSentence.

 import numpy as np from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence np.savetxt('train_data.txt', arr, delimiter=" ", fmt="%s") sentences = LineSentence('train_data.txt') model = Word2Vec(sentences) 
+1
source share

All Articles