Is it possible to retrain the word2vec model (for example, GoogleNews-vectors-negative300.bin) from the composition of sentences in python?

Question

Is it possible to retrain the word2vec model (for example, GoogleNews-vectors-negative300.bin) from the composition of sentences in python?

I am using a pre-prepared Google News Dataset to get word vectors using the Gensim library in python

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

After loading the model, I convert the workout by looking at the sentence words into vectors

 #reading all sentences from training file with open('restaurantSentences', 'r') as infile: x_train = infile.readlines() #cleaning sentences x_train = [review_to_wordlist(review,remove_stopwords=True) for review in x_train] train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])

In the word2Vec process, I get a lot of errors for words in my body that are not part of the model. The problem is how can I retrain an already prepared model (for example, GoogleNews-vectors-negative300.bin ') to get dictionary vectors for these missing words.

Here's what I tried: I trained a new model from the training offers that I had

 # Set values for various parameters num_features = 300 # Word vector dimensionality min_word_count = 10 # Minimum word count num_workers = 4 # Number of threads to run in parallel context = 10 # Context window size downsampling = 1e-3 # Downsample setting for frequent words sentences = gensim.models.word2vec.LineSentence("restaurantSentences") # Initialize and train the model (this will take some time) print "Training model..." model = gensim.models.Word2Vec(sentences, workers=num_workers,size=num_features, min_count = min_word_count, window = context, sample = downsampling) model.build_vocab(sentences) model.train(sentences) model.n_similarity(["food"], ["rice"])

It worked! but the problem is that I have a really small data set and fewer resources to train a large model.

The second way that I am considering is to expand an already prepared model, such as GoogleNews-vectors-negative300.bin.

 model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) sentences = gensim.models.word2vec.LineSentence("restaurantSentences") model.train(sentences)

Is it possible, and this is a good way to use, please help me

+7

python nlp gensim word2vec

Noman dilawar Jan 31 '16 at 18:17

source share

3 answers

orluke · Answer 1 · 2016-03-31T19:45:41+0000

Some people have been working on the gensim extension for online learning.

The GitHub pair is pulling requests that you might want to watch for progress in these efforts:

Looks like this improvement may allow you to update the GoogleNews-vector-negative300.bin model.

Chris arnold · Answer 2 · 2016-10-24T11:03:12+0000

This is how I technically solved the problem:

Preparing data entry with a sentence that is repeated from Radim Rehurek: https://rare-technologies.com/word2vec-tutorial/

 sentences = MySentences('newcorpus')

Model customization

 model = gensim.models.Word2Vec(sentences)

Dictionary intersection with google word vectors

 model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin', lockf=1.0, binary=True)

Finally, run the model and update

 model.train(sentences)

Warning note: From a substantive point of view, it is, of course, highly debatable whether the corpus is likely to be very few actually can “improve” the Google wordvectors trained on the massive corpus ...

majid abolghasemi · Answer 3 · 2016-08-15T09:56:46+0000

perhaps if the model designer has not completed the model training. in python this is:

 model.sims(replace=True) #finalize the model

If the model has not completed, this is an ideal way to have a model with a large dataset.

Is it possible to retrain the word2vec model (for example, GoogleNews-vectors-negative300.bin) from the composition of sentences in python?

More articles: