LDA gensim, the distance between two different documents

EDIT: I found an interesting problem here. This link shows that gensim uses randomness in both the learning stages and the conclusions. Therefore, it is proposed here to establish a fixed seed in order to get the same results every time. Why do I get the same probability for each topic?

What I want to do is to find my own themes for each Twitter user and calculate the similarity between Twitter users based on the similarity of the topics. Is it possible to calculate the same themes for each user in gensim, or do I need to calculate a dictionary of topics and clusters for each user theme?

In general, is this the best way to compare two Twitter users based on extracting accurate models in gensim? My code is as follows:

def preprocess(id): #Returns user word list (or list of user tweet) user_list = user_corpus(id, 'user_'+str(id)+'.txt') documents = [] for line in open('user_'+str(id)+'.txt'): documents.append(line) #remove stop words lines = [line.rstrip() for line in open('stoplist.txt')] stoplist= set(lines) texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents] # remove words that appear only once all_tokens = sum(texts, []) tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) < 3) texts = [[word for word in text if word not in tokens_once] for text in texts] words = [] for text in texts: for word in text: words.append(word) return words words1 = preprocess(14937173) words2 = preprocess(15386966) #Load the trained model lda = ldamodel.LdaModel.load('tmp/fashion1.lda') dictionary = corpora.Dictionary.load('tmp/fashion1.dict') #Load the trained dict corpus = [dictionary.doc2bow(words1)] tfidf = models.TfidfModel(corpus) corpus_tfidf = tfidf[corpus] corpus_lda = lda[corpus_tfidf] list1 = [] for item in corpus_lda: list1.append(item) print lda.show_topic(0) corpus2 = [dictionary.doc2bow(words2)] tfidf2 = models.TfidfModel(corpus2) corpus_tfidf2 = tfidf2[corpus2] corpus_lda2 = lda[corpus_tfidf2] list2 = [] for it in corpus_lda2: list2.append(it) print corpus_lda.show_topic(0) 

The returned topic of probability for the user corpus (when used as part of a list of user words):

  [(0, 0.10000000000000002), (1, 0.10000000000000002), (2, 0.10000000000000002), (3, 0.10000000000000002), (4, 0.10000000000000002), (5, 0.10000000000000002), (6, 0.10000000000000002), (7, 0.10000000000000002), (8, 0.10000000000000002), (9, 0.10000000000000002)] 

In the case where I use a list of custom tweets, I return calculated themes for each tweet.

Question 2: Does the following make sense: training the LDA model by multiple Twitter users and calculating the theme for each user (with each user corps) using the LDA model calculated before?

In the above example, list[0] returns a distribution of topics with equal probabilities 0.1. Basically, each line of text corresponds to a different tweet. If I calculate the corpus with corpus = [dictionary.doc2bow(text) for text in texts] , it will give me the probabilities for each tweet separately. On the other hand, if I use corpus = [dictionary.doc2bow(words)] , as in the example, I will only have all user words like corpus. In the second case, gensim returns the same probabilities for all topics. Thus, for both users I get the same distributions.

Should custom corpus text be a list of words or a list of sentences (list of tweets)?

Regarding the implementation of Qi He and Jianshu Weng in twitterRank's approach on page 264, he says that: we combine tweets published by individual twitterer into a large document. Thus, each document corresponds to twitterer. Well, I'm confused, if the document contains all user tweets, then what should the body contain?

+7
python probability gensim
source share
2 answers

Fere Res check the following sentence here . First, you need to calculate the lda model for all users, and then using the extracted vector of an unknown document, which is calculated here as

 vec_bow = dictionary.doc2bow(doc.lower().split()) vec_lda = lda[vec_bow] 

If you print the following: print(vec_lda) you will receive an invisible document distribution on topics of the lda model.

+2
source share

According to the official document โ€œHidden Distribution of Dirichletโ€, LDA is a conversion from a summary word into a numerical space of lower dimension.

You can use LSI at the top of TFIDF, but not LDA. If you use TFIDF on the LDA, then it will generate each topic almost the same way, you can print it and check it.

Also see https://radimrehurek.com/gensim/tut2.html .

+1
source share

All Articles