What class of gensim corpora should be used to load a converted LDA enclosure? - Python

Question

What class of gensim corpora should be used to load a converted LDA enclosure? - Python

How to download converted LDA package from python gensim ? What I tried:

 from gensim import corpora, models import numpy.random numpy.random.seed(10) doc0 = [(0, 1), (1, 1)] doc1 = [(0,1)] doc2 = [(0, 1), (1, 1)] doc3 = [(0, 3), (1, 1)] corpus = [doc0,doc1,doc2,doc3] dictionary = corpora.Dictionary(corpus) tfidf = models.TfidfModel(corpus) corpus_tfidf = tfidf[corpus] corpus_tfidf.save('x.corpus_tfidf') # To access the tfidf fitted corpus i've saved i used corpora.MmCorpus.load() corpus_tfidf = corpora.MmCorpus.load('x.corpus_tfidf') lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2) corpus_lda = lda[corpus] corpus_lda.save('x.corpus_lda') for i,j in enumerate(corpus_lda): print j, corpus[i]

The above code outputs:

 [(0, 0.54259038344543631), (1, 0.45740961655456358)] [(0, 1), (1, 1)] [(0, 0.56718063124157458), (1, 0.43281936875842542)] [(0, 1)] [(0, 0.54255407573666647), (1, 0.45744592426333358)] [(0, 1), (1, 1)] [(0, 0.75229707773868093), (1, 0.2477029222613191)] [(0, 3), (1, 1)] # [(<topic_number_from x.corpus_lda model>, # <probability of this topic for this document>), # (<topic# from lda model>, <prob of this top for this doc>)] [<document[i] from corpus>]

If I want to load a saved converted LDA body, which class from gensim should I use to load it?

I tried using corpora.MmCorpus.load() , it does not give me the same result of the converted case, as shown above:

 >>> lda_corpus = corpora.MmCorpus.load('x.corpus_lda') >>> for i,j in enumerate(lda_corpus): ... print j, corpus[i] ... [(0, 0.55087839240547309), (1, 0.44912160759452685)] [(0, 1), (1, 1)] [(0, 0.56715974584850259), (1, 0.43284025415149735)] [(0, 1)] [(0, 0.54275680271070581), (1, 0.45724319728929413)] [(0, 1), (1, 1)] [(0, 0.75233330695720912), (1, 0.24766669304279079)] [(0, 3), (1, 1)]

+4

python nlp corpus gensim lda

alvas Mar 03 '13 at 10:20

source share

2 answers

After testing all the possible classes in corpora.XCorpus ( http://radimrehurek.com/gensim/apiref.html ), I tried to load using BleiCorpus and it looks like it generated the same result with smaller decimal digits as a saved model.

 >>> from gensim import corpora, models >>> import numpy.random >>> numpy.random.seed(10) >>> >>> doc0 = [(0, 1), (1, 1)] >>> doc1 = [(0,1)] >>> doc2 = [(0, 1), (1, 1)] >>> doc3 = [(0, 3), (1, 1)] >>> corpus = [doc0,doc1,doc2,doc3] >>> dictionary = corpora.Dictionary(corpus) >>> >>> tfidf = models.TfidfModel(corpus) >>> corpus_tfidf = tfidf[corpus] >>> >>> lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=3) >>> corpus_lda = lda[corpus] >>> corpus_lda.save('x.corpus_lda') >>> >>> for i,j in enumerate(corpus_lda): ... print j, corpus[i] ... [(0, 0.15441373560695118), (1, 0.56498524668290762), (2, 0.28060101771014123)] [(0, 1), (1, 1)] [(0, 0.59512220481946487), (1, 0.22817873367464175), (2, 0.17669906150589348)] [(0, 1)] [(0, 0.52219543266162705), (1, 0.15449347037173339), (2, 0.32331109696663957)] [(0, 1), (1, 1)] [(0, 0.83364632205849853), (1, 0.086514534997754619), (2, 0.079839142943746944)] [(0, 3), (1, 1)] >>> >>> lda_corpus = corpora.BleiCorpus.load('x.corpus_lda') >>> for i,j in enumerate(lda_corpus): ... print j, corpus[i] ... [(0, 0.154413735607), (1, 0.564985246683), (2, 0.280601017710)] [(0, 1), (1, 1)] [(0, 0.595122204819), (1, 0.228178733675), (2, 0.176699061506)] [(0, 1)] [(0, 0.522195432662), (1, 0.154493470372), (2, 0.323311096967)] [(0, 1), (1, 1)] [(0, 0.833646322058), (1, 0.086514534998), (2, 0.079839142944)] [(0, 3), (1, 1)]

+1

alvas Mar 03 '13 at 10:51

source share

Radim · Accepted Answer · 2014-04-02T22:34:32+0000

Your code has more problems.

To save the case in MatrixMarket format, you

 corpora.MmCorpus.serialize('x.corpus_lda', corpus_lda)

The docs are here .

You train on corpus_tfidf , but then transform only lda[corpus] (no tfidf). Use either tfidf or simple words, but use it consistently.

What class of gensim corpora should be used to load a converted LDA enclosure? - Python

More articles: