Django Haystack / ElasticSearch indexing process aborted

Question

Django Haystack / ElasticSearch indexing process aborted

I am starting the installation with django 1.4, Haystack 2 beta and ElasticSearch. 20. My database is postgresql 9.1, which contains several million records. When I try to index all my data using haystack / elasticsearch, the process expires and I get a message that just says “Killed”. So far I have noticed the following:

I get the number of documents to index, so I do not get an error, for example, "0 documents to index".
Indexing a small set, such as 1000, works great.
I tried hard-coded timeout in haystack/backends/__init__.py and it seems to have no effect.
I tried changing the parameters in elasticsearch.yml to no avail either.

If hardcoding timeout does not work, then how else can I extend the time for indexing? Is there any other way to change this directly in ElasticSearch? Or maybe some kind of batch processing method?

Thanks in advance!

+4

python django elasticsearch django-haystack

maximus Jan 26 '13 at 1:40

source share

3 answers

I would risk that the problem is with the creation of documents for sending to ElasticSearch, and that using the batch-size option will help you.

The update method in the ElasticSearch backend prepares documents for indexing from each provided set of queries, and then makes a separate insert for this set of queries.

 self.conn.bulk_index(self.index_name, 'modelresult', prepped_docs, id_field=ID)

So it looks like if you have a table with millions of records, running update_index on this indexed model means that you need to generate these millions of documents and then index them. I would risk it is a problem. Setting a batch limit with the --batch-size parameter should limit the documents generated by slicing a set of requests for the size of your batch.

+6

bennylope Feb 01 '13 at 20:28

source share

Have you observed the memory that your process consumes when you try to index all of these records? Usually, when you see "Killed", it means that your system has run out of memory, and the OOM killer decided to kill your process to free up system resources.

+1

girasquid Jan 28 '13 at 20:25

source share

maximus · Accepted Answer · 2013-02-04T06:47:00+0000

This version of the haystack is a bug. The line of code causing the problem was found in the haystack / management / commands / update_index.py file in the following line:

 pks_seen = set([smart_str(pk) for pk in qs.values_list('pk', flat=True)])

Causes a lack of memory at the server. However, this does not seem to be required for indexing. So, I just changed it to:

 pks_seen = set([])

Now he goes through the parties. Thanks to all who responded!

Django Haystack / ElasticSearch indexing process aborted

More articles: