I am using a pre-prepared Google News Dataset to get word vectors using the Gensim library in python
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
After loading the model, I convert the workout by looking at the sentence words into vectors
#reading all sentences from training file with open('restaurantSentences', 'r') as infile: x_train = infile.readlines()
In the word2Vec process, I get a lot of errors for words in my body that are not part of the model. The problem is how can I retrain an already prepared model (for example, GoogleNews-vectors-negative300.bin ') to get dictionary vectors for these missing words.
Here's what I tried: I trained a new model from the training offers that I had
# Set values for various parameters num_features = 300 # Word vector dimensionality min_word_count = 10 # Minimum word count num_workers = 4 # Number of threads to run in parallel context = 10 # Context window size downsampling = 1e-3 # Downsample setting for frequent words sentences = gensim.models.word2vec.LineSentence("restaurantSentences") # Initialize and train the model (this will take some time) print "Training model..." model = gensim.models.Word2Vec(sentences, workers=num_workers,size=num_features, min_count = min_word_count, window = context, sample = downsampling) model.build_vocab(sentences) model.train(sentences) model.n_similarity(["food"], ["rice"])
It worked! but the problem is that I have a really small data set and fewer resources to train a large model.
The second way that I am considering is to expand an already prepared model, such as GoogleNews-vectors-negative300.bin.
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) sentences = gensim.models.word2vec.LineSentence("restaurantSentences") model.train(sentences)
Is it possible, and this is a good way to use, please help me
python nlp gensim word2vec
Noman dilawar
source share