EDIT: I found an interesting problem here. This link shows that gensim uses randomness in both the learning stages and the conclusions. Therefore, it is proposed here to establish a fixed seed in order to get the same results every time. Why do I get the same probability for each topic?
What I want to do is to find my own themes for each Twitter user and calculate the similarity between Twitter users based on the similarity of the topics. Is it possible to calculate the same themes for each user in gensim, or do I need to calculate a dictionary of topics and clusters for each user theme?
In general, is this the best way to compare two Twitter users based on extracting accurate models in gensim? My code is as follows:
def preprocess(id): #Returns user word list (or list of user tweet) user_list = user_corpus(id, 'user_'+str(id)+'.txt') documents = [] for line in open('user_'+str(id)+'.txt'): documents.append(line) #remove stop words lines = [line.rstrip() for line in open('stoplist.txt')] stoplist= set(lines) texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents] # remove words that appear only once all_tokens = sum(texts, []) tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) < 3) texts = [[word for word in text if word not in tokens_once] for text in texts] words = [] for text in texts: for word in text: words.append(word) return words words1 = preprocess(14937173) words2 = preprocess(15386966) #Load the trained model lda = ldamodel.LdaModel.load('tmp/fashion1.lda') dictionary = corpora.Dictionary.load('tmp/fashion1.dict') #Load the trained dict corpus = [dictionary.doc2bow(words1)] tfidf = models.TfidfModel(corpus) corpus_tfidf = tfidf[corpus] corpus_lda = lda[corpus_tfidf] list1 = [] for item in corpus_lda: list1.append(item) print lda.show_topic(0) corpus2 = [dictionary.doc2bow(words2)] tfidf2 = models.TfidfModel(corpus2) corpus_tfidf2 = tfidf2[corpus2] corpus_lda2 = lda[corpus_tfidf2] list2 = [] for it in corpus_lda2: list2.append(it) print corpus_lda.show_topic(0)
The returned topic of probability for the user corpus (when used as part of a list of user words):
[(0, 0.10000000000000002), (1, 0.10000000000000002), (2, 0.10000000000000002), (3, 0.10000000000000002), (4, 0.10000000000000002), (5, 0.10000000000000002), (6, 0.10000000000000002), (7, 0.10000000000000002), (8, 0.10000000000000002), (9, 0.10000000000000002)]
In the case where I use a list of custom tweets, I return calculated themes for each tweet.
Question 2: Does the following make sense: training the LDA model by multiple Twitter users and calculating the theme for each user (with each user corps) using the LDA model calculated before?
In the above example, list[0] returns a distribution of topics with equal probabilities 0.1. Basically, each line of text corresponds to a different tweet. If I calculate the corpus with corpus = [dictionary.doc2bow(text) for text in texts] , it will give me the probabilities for each tweet separately. On the other hand, if I use corpus = [dictionary.doc2bow(words)] , as in the example, I will only have all user words like corpus. In the second case, gensim returns the same probabilities for all topics. Thus, for both users I get the same distributions.
Should custom corpus text be a list of words or a list of sentences (list of tweets)?
Regarding the implementation of Qi He and Jianshu Weng in twitterRank's approach on page 264, he says that: we combine tweets published by individual twitterer into a large document. Thus, each document corresponds to twitterer. Well, I'm confused, if the document contains all user tweets, then what should the body contain?