An example of the selection of the distribution of the hidden dirichlet

Question

An example of the selection of the distribution of the hidden dirichlet

I am trying to learn about Latent Dirichlet Allocation (LDA). I have basic knowledge of machine learning and probability theory and are based on this blog post http://goo.gl/ccPvE I was able to develop my intuition behind the LDA. However, I still have not received a full understanding of the various calculations that go into it. I wonder if someone can show me the calculations using a very small case (say 3-5 sentences and 2-3 topics).

+4

topic-modeling lda

user737128 May 16 '12 at 18:48

source share

1 answer

john mangual · Answer 1 · 2012-12-14T04:03:25+0000

Edwin Chen (who works on twitter btw) has an example on his blog. 5 offers, 2 topics:

I like to eat broccoli and bananas.
I had breakfast for breakfast with banana and spinach.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster chewing a piece of broccoli.

Then he performs some “calculations”

Suggestions 1 and 2: 100% Theme A
Proposals 3 and 4: 100% Theme B
Proposal 5: 60% Theme A, 40% Theme B

And guess the topics:

Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% buzz, ...
- at this point you can interpret topic A as about food
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, ...
- at this point you can interpret topic B as cute animals

The question is, how did he come up with these numbers? What words in these sentences contain "information":

broccoli, bananas, cocktail, breakfast, chewing, eat
chinchilla, kitten, cute, adopted, hampster

Now release the sentence by sentence, getting the words from each topic:

food 3, cute 0 → food
food 5, cute 0 → food
food 0, sweetheart 3 → sweetheart
food 0, pretty 2 → pretty
food 2, cute 2 → 50% food + 50% cute

So, my numbers are a little different from Chen. Maybe it includes the word "piece" in a "piece of broccoli", considering it to be food.

We made two calculations in our heads:

to see suggestions and come up with 2 topics first. The LDA does this by treating each sentence as a “mixture” of topics and guessing the parameters of each topic.
to decide which words are important. LDA uses "term-frequency / inverse-document-frequency" to understand this.

An example of the selection of the distribution of the hidden dirichlet

More articles: