LSI using gensim in python

I use the Python gensim library for hidden semantic indexing. I followed the tutorials on the website and it works very well. Now I'm trying to change it a bit; I want to run the lsi model every time I add a document.

Here is my code:

stoplist = set('for a of the and to in'.split()) num_factors=3 corpus = [] for i in range(len(urls)): print "Importing", urls[i] doc = getwords(urls[i]) cleandoc = [word for word in doc.lower().split() if word not in stoplist] if i == 0: dictionary = corpora.Dictionary([cleandoc]) else: dictionary.addDocuments([cleandoc]) newVec = dictionary.doc2bow(cleandoc) corpus.append(newVec) tfidf = models.TfidfModel(corpus) corpus_tfidf = tfidf[corpus] lsi = models.LsiModel(corpus_tfidf, numTopics=num_factors, id2word=dictionary) corpus_lsi = lsi[corpus_tfidf] 

geturls is a function that I wrote that returns the contents of a website as a string. Again, this works if I wait while I process all the documents before doing tfidf and lsi, but this is not what I want. I want to do this in every iteration. Sorry, I get this error:

  Traceback (most recent call last): File "<stdin>", line 1, in <module> File "streamlsa.py", line 51, in <module> lsi = models.LsiModel(corpus_tfidf, numTopics=num_factors, id2word=dictionary) File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 303, in __init__ self.addDocuments(corpus) File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 365, in addDocuments self.printTopics(5) # TODO see if printDebug works and remove one of these.. File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 441, in printTopics self.printTopic(i, topN = numWords))) File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 433, in printTopic return ' + '.join(['%.3f*"%s"' % (1.0 * c[val] / norm, self.id2word[val]) for val in most]) File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/corpora/dictionary.py", line 52, in __getitem__ return self.id2token[tokenid] # will throw for non-existent ids KeyError: 1248 

Usually the error appears in the second document. I think I understand that this tells me (the word indices are bad), I just can not understand WHY. I tried many different things and nothing works. Does anyone know what is going on?

Thanks!

+4
source share
3 answers

This was a bug in gensim, where the inverse id-> word conversion is cached, but after addDocuments() cache was not updated.

In 2011, it was fixed: https://github.com/piskvorky/gensim/commit/b88225cfda8570557d3c72b0820fefb48064a049 .

+3
source

OK, so I found a solution, although not optimal.

If you create a dictionary with corpora.Dictionary , and then immediately add documents with dictionary.addDocuments , everything works fine.

But if you use a dictionary between these two calls (by calling dictionary.doc2bow or adding a dictionary to the lsi model using id2word ), your dictionary is "frozen" and cannot be updated. You can call dictionary.addDocuments and it will tell you that it is updated, and it will even tell you how big the new dictionary is, for example:

 INFO:dictionary:built Dictionary(6627 unique tokens) from 8 documents (total 24054 corpus positions) 

But when you reference any of the new indexes, you get an error message. I'm not sure if this is a mistake or if it is intended (for any reason), but at least the fact that gensim reports that the document was successfully added to the dictionary is undoubtedly a mistake.

At first I tried to put any dictionary calls into separate functions where only the local copy of the dictionary should be changed. Well, it is still breaking. This is strange to me, and I have no idea why.

My next step was to try to pass a copy of the dictionary using copy.copy . This works, but will obviously use a bit more overhead. However, this will allow you to keep a working copy of your case and dictionary. The biggest drawback here, however, was that this solution does not allow me to delete words that appear only once in the body using filterTokens , because this would entail a change in the dictionary.

Another solution is to simply rebuild everything (body, dictionary, lsi and tfidf models) at each iteration. Thanks to my small sample dataset, this gives me somewhat better results, but does not scale for very large datasets without causing memory problems. However, at the moment this is what I am doing.

If any experienced gensim users have a better (and more memory friendly) solution, so that I will not run into problems with larger datasets, please let me know!

+1
source

In doc2bow you can set allow_update = True and it will automatically update your dictionary with each iteration of doc2bow

http://radimrehurek.com/gensim/corpora/dictionary.html

0
source

All Articles