Hidden Dirichlet Distribution (LDA) is a topic for finding a hidden variable (s) that underlie many documents. I am using python gensim package and have two problems:
--- 1, I printed the most common words for each topic (I tried 10,20,50 topics) and found that the distribution of words is very βflatβ: this means that even the most common word has only 1% probability. ..
--- 2, most topics are similar: this means that the most frequently occurring words for each topic overlap a lot, and the topics share an almost identical set of words for their high-frequency words ...
I assume that the problem is probably related to my documents: my documents actually belong to a certain category, for example, all documents representing various online games. In my case, the LDA will still work, since the documents themselves are very similar, so a bag-of-words model might not be the best way to try?
Can someone give me some suggestions? Thanks!
source share