Here is my simplified and rather naive understanding of the difference:
As we know, CBOW is learning to predict words by context. Or maximize the probability of the target word by looking at the context. And this is a problem for rare words. For example, given that yesterday was a really [...] day context yesterday was a really [...] day afternoon, the CBOW model will tell you that it is most likely a beautiful or nice word. Words such as delightful will receive far less attention from the model because it is designed to predict the most likely word. This word will be smoothed in many examples with more frequent words.
On the other hand, the skip gram model is intended for context prediction. Given the delightful word, it should understand it and tell us that there is a high probability that yesterday was really [...] day context yesterday was really [...] day or some other relevant context. With a skip over the grammar, the word delightful will not try to compete with the word beautiful but instead, the pairs delightful+context will be treated as new observations.
UPDATE
Thanks @ 0xF for sharing this article
According to Mikolov
Skip-gram: it works well with a small amount of training data, it even represents rare words or phrases well.
CBOW: several times faster to train than skip grams, slightly better accuracy for commonly used words
Another addition to the topic can be found here :
In the skip-gram mode, alternative to CBOW, instead of averaging context words, each of them is used as an example of pairwise learning. Thus, instead of one CBOW example, such as [the predicate 'ate' from the average value ('The', 'cat', 'the', 'mouse')], the network is represented by four examples of skip grams [the predicate 'eat' from 'The'], [predict 'eat' from 'cat'], [predict 'eat' from 'the'], [predict 'eat' from 'mouse']. (The same random window reduction occurred, so half the time would be two examples of the closest words.)
Serhiy
source share