Why is iterating through a large Django QuerySet that consumes a huge amount of memory?

Question

Why is iterating through a large Django QuerySet that consumes a huge amount of memory?

The table in question contains approximately ten million rows.

for event in Event.objects.all(): print event

This leads to a constant increase in memory usage up to 4 GB or so, after which lines are quickly printed. The long delay before printing the first line surprised me - I expected it to print almost instantly.

I also tried Event.objects.iterator() , which behaved the same way.

I do not understand what Django loads into memory or why it does. I expected Django to iterate over the results at the database level, which means that the results will print at approximately constant speed (and not immediately after a long wait).

What did I misunderstand?

(I don't know if this value matters, but I'm using PostgreSQL.)

+103

sql django postgresql django-orm

davidchambers Nov 19 '10 at 4:49

source share

9 answers

It may not be faster or more efficient, but as a ready-made solution, why not use the paginator objects and django pages described here:

https://docs.djangoproject.com/en/dev/topics/pagination/

Something like that:

 from django.core.paginator import Paginator from djangoapp.models import model paginator = Paginator(model.objects.all(), 1000) # chunks of 1000, you can # change this to desired chunk size for page in range(1, paginator.num_pages + 1): for row in paginator.page(page).object_list: # here you can do whatever you want with the row print "done processing page %s" % page

+38

mpaf Nov 29 '13 at 12:19

source share

Django's default behavior is to cache the entire QuerySet result when evaluating a query. You can use the QuerySet iterator method to avoid this caching:

 for event in Event.objects.all().iterator(): print event

https://docs.djangoproject.com/en/dev/ref/models/querysets/#iterator

The iterator () method evaluates a set of queries and then directly reads the results without caching at the QuerySet level. This method leads to increased performance and a significant reduction in memory when repeating a large number of objects that you need to get only once. Note that caching is still performed at the database level.

Using iterator () reduces memory usage for me, but it's still higher than I expected. Using the paginator approach proposed by mpaf uses much less memory, but is 2-3 times slower for my test case.

 from django.core.paginator import Paginator def chunked_iterator(queryset, chunk_size=10000): paginator = Paginator(queryset, chunk_size) for page in range(1, paginator.num_pages + 1): for obj in paginator.page(page).object_list: yield obj for event in chunked_iterator(Event.objects.all()): print event

+25

Luke Moore Jul 20 '15 at 20:18

source share

This is from the docs: http://docs.djangoproject.com/en/dev/ref/models/querysets/

In fact, database activity does not occur until you do something to evaluate the set of queries.

So, when the print event is executed, the query is launched (which is a full table scan according to your command.) And loads the results. You request all the objects, and there is no way to get the first object without getting all of them.

But if you do something like:

 Event.objects.all()[300:900]

http://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets

Then it will internally add offsets and constraints for sql.

+8

nate since Nov 19 '10 at 5:17

source share

For large volumes of records, the database cursor works even better. You need raw SQL in Django, a Django cursor is something other than a SQL cursor.

The LIMIT - OFFSET method proposed by Nate C may be good enough for your situation. For large amounts of data, it is slower than the cursor, because it must run the same query again and again and must skip more and more results.

+7

Frank Heikens Nov 19 '10 at 7:57

source share

Django does not have a good solution for retrieving large items from a database.

 import gc # Get the events in reverse order eids = Event.objects.order_by("-id").values_list("id", flat=True) for index, eid in enumerate(eids): event = Event.object.get(id=eid) # do necessary work with event if index % 100 == 0: gc.collect() print("completed 100 items")

values_list can be used to retrieve all identifiers in databases, and then select each object separately. For some time, large objects will be created in memory and garbage will not be collected until the cycle cycle is completed. The code collects garbage manually after every 100th item.

+7

Kracekumar Jun 30 '14 at 14:10

source share

Thus, objects for a whole set of queries are loaded into memory immediately. You need to split your request into smaller digestible bits. The template for this is called false feeding. Here is a brief implementation.

 def spoonfeed(qs, func, chunk=1000, start=0): ''' Chunk up a large queryset and run func on each item. Works with automatic primary key fields. chunk -- how many objects to take on at once start -- PK to start from >>> spoonfeed(Spam.objects.all(), nom_nom) ''' while start < qs.order_by('pk').last().pk: for o in qs.filter(pk__gt=start, pk__lte=start+chunk): func(o) start += chunk

To use this, you write a function that performs operations on your object:

 def set_population_density(town): town.population_density = calculate_population_density(...) town.save()

and then run this function in your query:

 spoonfeed(Town.objects.all(), set_population_density)

This can be further improved by using multiprocessing to execute func for multiple objects in parallel.

+5

F. Malina Apr 04 '15 at 20:22

source share

Here's a solution including len and count:

 class GeneratorWithLen(object): """ Generator that includes len and count for given queryset """ def __init__(self, generator, length): self.generator = generator self.length = length def __len__(self): return self.length def __iter__(self): return self.generator def __getitem__(self, item): return self.generator.__getitem__(item) def next(self): return next(self.generator) def count(self): return self.__len__() def batch(queryset, batch_size=1024): """ returns a generator that does not cache results on the QuerySet Aimed to use with expected HUGE/ENORMOUS data sets, no caching, no memory used more than batch_size :param batch_size: Size for the maximum chunk of data in memory :return: generator """ total = queryset.count() def batch_qs(_qs, _batch_size=batch_size): """ Returns a (start, end, total, queryset) tuple for each batch in the given queryset. """ for start in range(0, total, _batch_size): end = min(start + _batch_size, total) yield (start, end, total, _qs[start:end]) def generate_items(): queryset.order_by() # Clearing... ordering by id if PK autoincremental for start, end, total, qs in batch_qs(queryset): for item in qs: yield item return GeneratorWithLen(generate_items(), total)

Using:

 events = batch(Event.objects.all()) len(events) == events.count() for event in events: # Do something with the Event

+3

danius Oct 29 '15 at 8:07

source share

I usually use raw raw MySQL query instead of Django ORM for this task.

MySQL supports streaming so that we can safely and quickly iterate over all records without memory errors.

 import MySQLdb db_config = {} # config your db here connection = MySQLdb.connect( host=db_config['HOST'], user=db_config['USER'], port=int(db_config['PORT']), passwd=db_config['PASSWORD'], db=db_config['NAME']) cursor = MySQLdb.cursors.SSCursor(connection) # SSCursor for streaming mode cursor.execute("SELECT * FROM event") while True: record = cursor.fetchone() if record is None: break # Do something with record here cursor.close() connection.close()

Ref:

0

Tho Jun 30 '17 at 4:32

source share

eternicode · Accepted Answer · 2010-11-19 05:44

Nate C was close, but not quite.

From documents :

You can evaluate QuerySet in the following ways:
Iteration
. The QuerySet is iterable, and it executes its query on the database the first time it is repeated. For example, this will print the title of all records in the database:
 for e in Entry.objects.all(): print e.headline 

So, your ten million rows are retrieved immediately when you first enter this loop and get an iterative form of a set of queries. The wait you are experiencing is Django loading database rows and creating objects for each of them before returning what you can really iterate over. Then you have everything in mind and the results come.

From my reading of documents, iterator() no more than circumvents QuerySet's internal caching mechanisms. I think it might make sense to do this one by one, but it, on the contrary, will require ten million individual hits in your database. Perhaps not all of this is desirable.

Iterating over large datasets efficiently is something that we still haven't gotten quite right, but there are some snippets that may be useful for your purposes:

Why is iterating through a large Django QuerySet that consumes a huge amount of memory?

More articles: