Best way to process a database in chunks with a Django QuerySet?

I run a batch operation on all rows in the database. This includes choosing each model and doing something with it. It makes sense to split it into pieces and make it a piece of piece.

I am currently using Paginator because it is convenient. This means that I need to order the values ​​so that they can be ordered in order. This generates SQL statements that have orderand clauses limit, and for each fragment, I think Postgres can sort the entire table (although I cannot claim to have any knowledge of internal components). All I know is that the database is at around 50% of the CPU, and I think it's too high to do selects.

What is the best way to iterate over an entire table in RDMBS / CPU mode?

Assuming that the contents of the database does not change during a batch operation.

+5
source share
2 answers

From your description, you really don't care about the sort order of the processed strings. If you have primary keys in your tables (what I expect!), This rough splitting method will be much faster :

SELECT * FROM tbl WHERE id BETWEEN 0    AND 1000;
SELECT * FROM tbl WHERE id BETWEEN 1001 AND 2000;
...

This is the same for any offset, and (almost) the same for any table size. Get the minimum and maximum your primary key and partition, respectively:

SELECT min(id), max(id) from tbl; -- then divide in suitable chunks

Unlike:

SELECT * FROM tbl ORDER BY id LIMIT 1000;
SELECT * FROM tbl ORDER BY id LIMIT 1000 OFFSET 1000;
...

, , , , .

+5

( BETWEEN) Django QuerySet:

, Django QuerySet, . , "id" BETWEEN.

def chunked_queryset(qs, batch_size, index='id'):
    """
    Yields a queryset split into batches of maximum size 'batch_size'.
    Any ordering on the queryset is discarded.
    """
    qs = qs.order_by()  # clear ordering
    min_max = qs.aggregate(min=models.Min(index), max=models.Max(index))
    min_id, max_id = min_max['min'], min_max['max']
    for i in range(min_id, max_id + 1, batch_size):
        filter_args = {'{0}__range'.format(index): (i, i + batch_size - 1)}
        yield qs.filter(**filter_args)

:

for chunk in chunked_queryset(SomeModel.objects.all(), 20):
    # `chunk` is a queryset
    for item in chunk:
        # `item` is a SomeModel instance
        pass
+2

All Articles