An example of the selection of the distribution of the hidden dirichlet

I am trying to learn about Latent Dirichlet Allocation (LDA). I have basic knowledge of machine learning and probability theory and are based on this blog post I was able to develop my intuition behind the LDA. However, I still have not received a full understanding of the various calculations that go into it. I wonder if someone can show me the calculations using a very small case (say 3-5 sentences and 2-3 topics).

source share
1 answer

Edwin Chen (who works on twitter btw) has an example on his blog. 5 offers, 2 topics:

  • I like to eat broccoli and bananas.
  • I had breakfast for breakfast with banana and spinach.
  • Chinchillas and kittens are cute.
  • My sister adopted a kitten yesterday.
  • Look at this cute hamster chewing a piece of broccoli.

Then he performs some “calculations”

  • Suggestions 1 and 2: 100% Theme A
  • Proposals 3 and 4: 100% Theme B
  • Proposal 5: 60% Theme A, 40% Theme B

And guess the topics:

  • Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% buzz, ...
    • at this point you can interpret topic A as about food
  • Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, ...
    • at this point you can interpret topic B as cute animals

The question is, how did he come up with these numbers? What words in these sentences contain "information":

  • broccoli, bananas, cocktail, breakfast, chewing, eat
  • chinchilla, kitten, cute, adopted, hampster

Now release the sentence by sentence, getting the words from each topic:

  • food 3, cute 0 → food
  • food 5, cute 0 → food
  • food 0, sweetheart 3 → sweetheart
  • food 0, pretty 2 → pretty
  • food 2, cute 2 → 50% food + 50% cute

So, my numbers are a little different from Chen. Maybe it includes the word "piece" in a "piece of broccoli", considering it to be food.

We made two calculations in our heads:

  • to see suggestions and come up with 2 topics first. The LDA does this by treating each sentence as a “mixture” of topics and guessing the parameters of each topic.
  • to decide which words are important. LDA uses "term-frequency / inverse-document-frequency" to understand this.


All Articles