The usual way to initialize a Word2Vec model in gensim is [1]
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
The question is, what are sentences ? sentences should be an iterator of iterations of words / tokens. This is similar to the numpy matrix, but each row may have a different length.
If you look at the documentation for gensim.models.word2vec.LineSentence , it gives you the ability to directly download text files as sentences. As a hint, according to the documentation, required
one sentence = one line; words already pre-processed and separated by spaces.
When he says words already preprocessed , this refers to filters with a lower case, stem, stopwatch and all other text cleaning processes. In your case, you do not want 5 and 6 be on your list of offers, so you need to filter them out.
Given that you already have a numpy matrix, assuming each row is a sentence, it is better to then translate it into a 2d array and filter out all 5 and 6 . The resulting 2d array can be used directly as the sentences argument to initialize the model. The only catch is that when you want to request a model after training, you need to enter indexes instead of tokens.
Now you have one question: if the model takes an integer directly. In the Python version, it does not check the type and simply passes unique tokens. Your unique indicators in this case will work fine. But most of the time you would like to use the C-Extended procedure to train your model, which is very important because it can give 70 times the performance. [2] I assume that in this case, the C code can check the type of the string, which means that the string-to-index mapping is stored.
Is it really ineffective? I think not, because the lines you have are numbers that are generally much shorter than the real token they represent (assuming they are compact indexes from 0 ). Therefore, the models will be smaller in size, which will save some effort on serializing and deserializing the model at the end. You, in fact, encoded the input tokens in a shorter string format and separated it from Word2Vec training, and the Word2Vec model Word2Vec not and should not know that this encoding occurred before the training.
My philosophy is try the simplest way first . I would just throw away a sample test input of integers into the model and see what could go wrong. Hope this helps.
[1] https://radimrehurek.com/gensim/models/word2vec.html
[2] http://rare-technologies.com/word2vec-in-python-part-two-optimizing/