Speed up volume insertion with Django ORM?

Question

Speed up volume insertion with Django ORM?

I plan to upload a billion records taken from ~ 750 files (each ~ 250 MB) into the database using django ORM. Currently, it takes ~ 20 minutes to process each file, and I was wondering if there is a way to speed up this process.

I took the following measures:

Use @ action.commit_manually and commit once every 5000 entries.
Set DEBUG = False so django does not accumulate all sql commands in memory
A cycle that writes to a single file is completely contained in one function (minimize stack changes)
Refrained from getting into the database for queries (used a local hash of objects already in the database, instead of using get_or_create )
Set force_insert = True to save () in the hope of preserving some logic in django
Explicitly set the identifier in the hope that this will save Django from logic
General code minimization and optimization

What else can I do to speed up the process? Here are some of my thoughts:

Use some Python compiler or version which is faster (Psyco?)
Override ORM and use SQL directly
Use some third-party code that might be better ( 1 , 2 )
I ask the django community to create a bulk_insert function

Any pointers regarding these items or any other ideas would be welcome :)

+43

optimization django orm bulkinsert

Jonathan Nov 27 2018-10-29T00-11-27

source share

7 answers

This does not apply to Django ORM, but lately I have had to massage inserts> 60 million rows of 8 data columns from more than 2000 files into sqlite3 database. And I found out that the following three things reduced insertion time from 48 hours to ~ 1 hour:

increase the cache size of your database to use more RAM (by default they are always very small, I used 3 GB); in sqlite this is done PRAGMA cache_size = n_of_pages;
write to RAM instead of a disk (this causes a minor problem if the system crashes, but something that I consider neglectable if you already have the original data on the disk); in sqlite this is done by PRAGMA journal_mode = MEMORY
last and perhaps most important: don't create an index while inserting. It also means that you cannot declare UNIQUE or other restrictions that may lead to the creation of a DB index. Create an index only after you finish pasting.

As mentioned earlier, you should also use cursor.executemany () (or just a shortcut to conn.executemany ()). To use it, do:

 cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data)

Iterable_data can be a list or something similar, or even a reader of an open file.

+16

Yanshuai Cao Sep 25 '12 at 20:23

source share

Drop the DB-API and use cursor.executemany() . See PEP 249 for details.

+12

Ignacio Vazquez-Abrams Nov 27 '10 at 22:19

source share

I did some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:

Insert 3000 rows individually and get identifiers from filled objects using Django ORM: 3200ms
Insert 3,000 rows with Pandas DataFrame.to_sql() and not get ids: 774ms
- Update 2019: Pandas 0.24.0 df.to_sql() faster due to the optional combination of several inserts into one statement - I did not check it
Insert 3000 rows with the Django .bulk_create(Model(**df.to_records())) manager .bulk_create(Model(**df.to_records())) and you won’t get the identifiers: 574ms
Paste 3000 rows with to_csv into the StringIO and COPY buffer ( cur.copy_from() ) and do not get the identifiers: 118ms
Insert 3000 rows with to_csv and COPY and get the identifiers with a simple SELECT WHERE ID > [max ID before insert] (maybe not thread-oriented if COPY holds the table lock, preventing simultaneous inserts?): 201 ms

 def bulk_to_sql(df, columns, model_cls): """ Inserting 3000 takes 774ms avg """ engine = ExcelImportProcessor._get_sqlalchemy_engine() df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False) def bulk_via_csv(df, columns, model_cls): """ Inserting 3000 takes 118ms avg """ engine = ExcelImportProcessor._get_sqlalchemy_engine() connection = engine.raw_connection() cursor = connection.cursor() output = StringIO() df[columns].to_csv(output, sep='\t', header=False, index=False) output.seek(0) contents = output.getvalue() cur = connection.cursor() cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns) connection.commit() cur.close()

All performance indicators were obtained for a table that already contains 3,000 rows running on OS X (16 GB i7 SSD), on average, out of ten runs using timeit .

I return my inserted primary keys by assigning an identifier for the import package and sorting by primary key, although I'm not 100% sure that the primary keys will always be assigned in the order in which the rows are serialized for the COPY - I would be grateful for the views anyway ,

+6

Chris Jul 21 '17 at 20:18

source share

There is also a volume insert fragment http://djangosnippets.org/snippets/446/ .

This gives one command to insert several pairs of values (INSERT INTO x (val1, val2) VALUES (1,2), (3,4) --etc, etc.). This should significantly improve performance.

It is also heavily documented, which is always a plus.

+5

Seaux Feb 09 2018-11-11T00:

source share

Alternatively, if you need something quick and easy, you can try the following: http://djangosnippets.org/snippets/2362/ . This is a simple manager that I used in the project.

The other piece was not so simple and really focused on volume inserts for relationships. This is just a simple insert and just uses the same INSERT query.

+3

Seaux Feb 18 '11 at 22:48

source share

Django development got bulk_create: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create

+3

Ilia Novoselov Oct. 20 '11 at 12:59

source share

Gary · Accepted Answer · 2012-02-13 21:23

Django 1.4 provides a bulk_create() method for a QuerySet object:

Speed ​​up volume insertion with Django ORM?

More articles:

Speed up volume insertion with Django ORM?