OK, so I found a solution, although not optimal.
If you create a dictionary with corpora.Dictionary , and then immediately add documents with dictionary.addDocuments , everything works fine.
But if you use a dictionary between these two calls (by calling dictionary.doc2bow or adding a dictionary to the lsi model using id2word ), your dictionary is "frozen" and cannot be updated. You can call dictionary.addDocuments and it will tell you that it is updated, and it will even tell you how big the new dictionary is, for example:
INFO:dictionary:built Dictionary(6627 unique tokens) from 8 documents (total 24054 corpus positions)
But when you reference any of the new indexes, you get an error message. I'm not sure if this is a mistake or if it is intended (for any reason), but at least the fact that gensim reports that the document was successfully added to the dictionary is undoubtedly a mistake.
At first I tried to put any dictionary calls into separate functions where only the local copy of the dictionary should be changed. Well, it is still breaking. This is strange to me, and I have no idea why.
My next step was to try to pass a copy of the dictionary using copy.copy . This works, but will obviously use a bit more overhead. However, this will allow you to keep a working copy of your case and dictionary. The biggest drawback here, however, was that this solution does not allow me to delete words that appear only once in the body using filterTokens , because this would entail a change in the dictionary.
Another solution is to simply rebuild everything (body, dictionary, lsi and tfidf models) at each iteration. Thanks to my small sample dataset, this gives me somewhat better results, but does not scale for very large datasets without causing memory problems. However, at the moment this is what I am doing.
If any experienced gensim users have a better (and more memory friendly) solution, so that I will not run into problems with larger datasets, please let me know!