Best way to process a database in chunks with a Django QuerySet?

Question

Best way to process a database in chunks with a Django QuerySet?

I run a batch operation on all rows in the database. This includes choosing each model and doing something with it. It makes sense to split it into pieces and make it a piece of piece.

I am currently using Paginator because it is convenient. This means that I need to order the values so that they can be ordered in order. This generates SQL statements that have orderand clauses limit, and for each fragment, I think Postgres can sort the entire table (although I cannot claim to have any knowledge of internal components). All I know is that the database is at around 50% of the CPU, and I think it's too high to do selects.

What is the best way to iterate over an entire table in RDMBS / CPU mode?

Assuming that the contents of the database does not change during a batch operation.

+5

django postgresql django-models

Joe Jan 03 '12 at 0:17

source share

2 answers

( BETWEEN) Django QuerySet:

, Django QuerySet, . , "id" BETWEEN.

def chunked_queryset(qs, batch_size, index='id'):
    """
    Yields a queryset split into batches of maximum size 'batch_size'.
    Any ordering on the queryset is discarded.
    """
    qs = qs.order_by()  # clear ordering
    min_max = qs.aggregate(min=models.Min(index), max=models.Max(index))
    min_id, max_id = min_max['min'], min_max['max']
    for i in range(min_id, max_id + 1, batch_size):
        filter_args = {'{0}__range'.format(index): (i, i + batch_size - 1)}
        yield qs.filter(**filter_args)

:

for chunk in chunked_queryset(SomeModel.objects.all(), 20):
    # `chunk` is a queryset
    for item in chunk:
        # `item` is a SomeModel instance
        pass

+2

spookylukey 17 . '16 18:54

Erwin Brandstetter · Accepted Answer · 2012-01-03T03:24:00+0000

From your description, you really don't care about the sort order of the processed strings. If you have primary keys in your tables (what I expect!), This rough splitting method will be much faster :

SELECT * FROM tbl WHERE id BETWEEN 0    AND 1000;
SELECT * FROM tbl WHERE id BETWEEN 1001 AND 2000;
...

This is the same for any offset, and (almost) the same for any table size. Get the minimum and maximum your primary key and partition, respectively:

SELECT min(id), max(id) from tbl; -- then divide in suitable chunks

Unlike:

SELECT * FROM tbl ORDER BY id LIMIT 1000;
SELECT * FROM tbl ORDER BY id LIMIT 1000 OFFSET 1000;
...

, , , , .

Best way to process a database in chunks with a Django QuerySet?

More articles: