CBOW vs skip-gram: why invert context and target words?

This page says that:

[...] skip-gram inverts contexts and goals and tries to predict each context word from its target word [...]

However, looking at the training dataset that he produces, the contents of the pair X and Y appear to be interchangeable, since these two pairs (X, Y):

(quick, brown), (brown, quick)

So, why distinguish this between context and goals, if in the end it is one and the same?

Also, doing the Udacity Deep Learning exercise on word2vec , I wonder why they seem to make the difference between the two approaches, the problem is:

An alternative to a skip gram is another Word2Vec model called CBOW (Continuous Bag of Words). In the CBOW model, instead of predicting a context word from a dictionary vector, you predict a word from the sum of all word vectors in its context. Implement and evaluate the CBOW model prepared from the text8 dataset.

Will this lead to the same results?

+24
deep-learning nlp tensorflow word2vec word-embedding
source share
3 answers

Here is my simplified and rather naive understanding of the difference:

As we know, CBOW is learning to predict words by context. Or maximize the probability of the target word by looking at the context. And this is a problem for rare words. For example, given that yesterday was a really [...] day context yesterday was a really [...] day afternoon, the CBOW model will tell you that it is most likely a beautiful or nice word. Words such as delightful will receive far less attention from the model because it is designed to predict the most likely word. This word will be smoothed in many examples with more frequent words.

On the other hand, the skip gram model is intended for context prediction. Given the delightful word, it should understand it and tell us that there is a high probability that yesterday was really [...] day context yesterday was really [...] day or some other relevant context. With a skip over the grammar, the word delightful will not try to compete with the word beautiful but instead, the pairs delightful+context will be treated as new observations.

UPDATE

Thanks @ 0xF for sharing this article

According to Mikolov

Skip-gram: it works well with a small amount of training data, it even represents rare words or phrases well.

CBOW: several times faster to train than skip grams, slightly better accuracy for commonly used words

Another addition to the topic can be found here :

In the skip-gram mode, alternative to CBOW, instead of averaging context words, each of them is used as an example of pairwise learning. Thus, instead of one CBOW example, such as [the predicate 'ate' from the average value ('The', 'cat', 'the', 'mouse')], the network is represented by four examples of skip grams [the predicate 'eat' from 'The'], [predict 'eat' from 'cat'], [predict 'eat' from 'the'], [predict 'eat' from 'mouse']. (The same random window reduction occurred, so half the time would be two examples of the closest words.)

+45
source share

This is due to what exactly you calculate at any given point. The difference will become clearer if you start looking at models that contain a wider context for each probability calculation.

In a skip gram, you compute the context word (s) from the word at the current position of the sentence; you “skip” the current word (and perhaps a little context) in your calculations. The result can be several words (but not if your context window has only one word).

In CBOW, you compute the current word from the context word (s), so you end up with only one word.

0
source share

Here is a link to the 2013 arxiv presentation, where Google engineers first described the CBOW and Skip-gram models:

Effective assessment of verbal representations in vector space

Check sections 3.1 and 3.2 for details.

0
source share

All Articles