Edwin Chen (who works on twitter btw) has an example on his blog. 5 offers, 2 topics:
- I like to eat broccoli and bananas.
- I had breakfast for breakfast with banana and spinach.
- Chinchillas and kittens are cute.
- My sister adopted a kitten yesterday.
- Look at this cute hamster chewing a piece of broccoli.
Then he performs some “calculations”
- Suggestions 1 and 2: 100% Theme A
- Proposals 3 and 4: 100% Theme B
- Proposal 5: 60% Theme A, 40% Theme B
And guess the topics:
- Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% buzz, ...
- at this point you can interpret topic A as about food
- Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, ...
- at this point you can interpret topic B as cute animals
The question is, how did he come up with these numbers? What words in these sentences contain "information":
- broccoli, bananas, cocktail, breakfast, chewing, eat
- chinchilla, kitten, cute, adopted, hampster
Now release the sentence by sentence, getting the words from each topic:
- food 3, cute 0 → food
- food 5, cute 0 → food
- food 0, sweetheart 3 → sweetheart
- food 0, pretty 2 → pretty
- food 2, cute 2 → 50% food + 50% cute
So, my numbers are a little different from Chen. Maybe it includes the word "piece" in a "piece of broccoli", considering it to be food.
We made two calculations in our heads:
- to see suggestions and come up with 2 topics first. The LDA does this by treating each sentence as a “mixture” of topics and guessing the parameters of each topic.
- to decide which words are important. LDA uses "term-frequency / inverse-document-frequency" to understand this.
source share