Is it possible to retrain the word2vec model (for example, GoogleNews-vectors-negative300.bin) from the composition of sentences in python?

I am using a pre-prepared Google News Dataset to get word vectors using the Gensim library in python

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) 

After loading the model, I convert the workout by looking at the sentence words into vectors

 #reading all sentences from training file with open('restaurantSentences', 'r') as infile: x_train = infile.readlines() #cleaning sentences x_train = [review_to_wordlist(review,remove_stopwords=True) for review in x_train] train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train]) 

In the word2Vec process, I get a lot of errors for words in my body that are not part of the model. The problem is how can I retrain an already prepared model (for example, GoogleNews-vectors-negative300.bin ') to get dictionary vectors for these missing words.

Here's what I tried: I trained a new model from the training offers that I had

 # Set values for various parameters num_features = 300 # Word vector dimensionality min_word_count = 10 # Minimum word count num_workers = 4 # Number of threads to run in parallel context = 10 # Context window size downsampling = 1e-3 # Downsample setting for frequent words sentences = gensim.models.word2vec.LineSentence("restaurantSentences") # Initialize and train the model (this will take some time) print "Training model..." model = gensim.models.Word2Vec(sentences, workers=num_workers,size=num_features, min_count = min_word_count, window = context, sample = downsampling) model.build_vocab(sentences) model.train(sentences) model.n_similarity(["food"], ["rice"]) 

It worked! but the problem is that I have a really small data set and fewer resources to train a large model.

The second way that I am considering is to expand an already prepared model, such as GoogleNews-vectors-negative300.bin.

 model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) sentences = gensim.models.word2vec.LineSentence("restaurantSentences") model.train(sentences) 

Is it possible, and this is a good way to use, please help me

+7
python nlp gensim word2vec
source share
3 answers

Some people have been working on the gensim extension for online learning.

The GitHub pair is pulling requests that you might want to watch for progress in these efforts:

Looks like this improvement may allow you to update the GoogleNews-vector-negative300.bin model.

+1
source share

This is how I technically solved the problem:

Preparing data entry with a sentence that is repeated from Radim Rehurek: https://rare-technologies.com/word2vec-tutorial/

 sentences = MySentences('newcorpus') 

Model customization

 model = gensim.models.Word2Vec(sentences) 

Dictionary intersection with google word vectors

 model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin', lockf=1.0, binary=True) 

Finally, run the model and update

 model.train(sentences) 

Warning note: From a substantive point of view, it is, of course, highly debatable whether the corpus is likely to be very few actually can β€œimprove” the Google wordvectors trained on the massive corpus ...

+1
source share

perhaps if the model designer has not completed the model training. in python this is:

 model.sims(replace=True) #finalize the model 

If the model has not completed, this is an ideal way to have a model with a large dataset.

0
source share

All Articles