Speed ​​up volume insertion with Django ORM?

I plan to upload a billion records taken from ~ 750 files (each ~ 250 MB) into the database using django ORM. Currently, it takes ~ 20 minutes to process each file, and I was wondering if there is a way to speed up this process.

I took the following measures:

  • Use @ action.commit_manually and commit once every 5000 entries.
  • Set DEBUG = False so django does not accumulate all sql commands in memory
  • A cycle that writes to a single file is completely contained in one function (minimize stack changes)
  • Refrained from getting into the database for queries (used a local hash of objects already in the database, instead of using get_or_create )
  • Set force_insert = True to save () in the hope of preserving some logic in django
  • Explicitly set the identifier in the hope that this will save Django from logic
  • General code minimization and optimization

What else can I do to speed up the process? Here are some of my thoughts:

  • Use some Python compiler or version which is faster (Psyco?)
  • Override ORM and use SQL directly
  • Use some third-party code that might be better ( 1 , 2 )
  • I ask the django community to create a bulk_insert function

Any pointers regarding these items or any other ideas would be welcome :)

+43
optimization django orm bulkinsert
Nov 27 2018-10-29T00-11-27
source share
7 answers
β€” -

This does not apply to Django ORM, but lately I have had to massage inserts> 60 million rows of 8 data columns from more than 2000 files into sqlite3 database. And I found out that the following three things reduced insertion time from 48 hours to ~ 1 hour:

  • increase the cache size of your database to use more RAM (by default they are always very small, I used 3 GB); in sqlite this is done PRAGMA cache_size = n_of_pages;

  • write to RAM instead of a disk (this causes a minor problem if the system crashes, but something that I consider neglectable if you already have the original data on the disk); in sqlite this is done by PRAGMA journal_mode = MEMORY

  • last and perhaps most important: don't create an index while inserting. It also means that you cannot declare UNIQUE or other restrictions that may lead to the creation of a DB index. Create an index only after you finish pasting.

As mentioned earlier, you should also use cursor.executemany () (or just a shortcut to conn.executemany ()). To use it, do:

 cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data) 

Iterable_data can be a list or something similar, or even a reader of an open file.

+16
Sep 25 '12 at 20:23
source share

Drop the DB-API and use cursor.executemany() . See PEP 249 for details.

+12
Nov 27 '10 at 22:19
source share

I did some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:

  • Insert 3000 rows individually and get identifiers from filled objects using Django ORM: 3200ms
  • Insert 3,000 rows with Pandas DataFrame.to_sql() and not get ids: 774ms
  • Insert 3000 rows with the Django .bulk_create(Model(**df.to_records())) manager .bulk_create(Model(**df.to_records())) and you won’t get the identifiers: 574ms
  • Paste 3000 rows with to_csv into the StringIO and COPY buffer ( cur.copy_from() ) and do not get the identifiers: 118ms
  • Insert 3000 rows with to_csv and COPY and get the identifiers with a simple SELECT WHERE ID > [max ID before insert] (maybe not thread-oriented if COPY holds the table lock, preventing simultaneous inserts?): 201 ms
 def bulk_to_sql(df, columns, model_cls): """ Inserting 3000 takes 774ms avg """ engine = ExcelImportProcessor._get_sqlalchemy_engine() df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False) def bulk_via_csv(df, columns, model_cls): """ Inserting 3000 takes 118ms avg """ engine = ExcelImportProcessor._get_sqlalchemy_engine() connection = engine.raw_connection() cursor = connection.cursor() output = StringIO() df[columns].to_csv(output, sep='\t', header=False, index=False) output.seek(0) contents = output.getvalue() cur = connection.cursor() cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns) connection.commit() cur.close() 

All performance indicators were obtained for a table that already contains 3,000 rows running on OS X (16 GB i7 SSD), on average, out of ten runs using timeit .

I return my inserted primary keys by assigning an identifier for the import package and sorting by primary key, although I'm not 100% sure that the primary keys will always be assigned in the order in which the rows are serialized for the COPY - I would be grateful for the views anyway ,

+6
Jul 21 '17 at 20:18
source share

There is also a volume insert fragment http://djangosnippets.org/snippets/446/ .

This gives one command to insert several pairs of values ​​(INSERT INTO x (val1, val2) VALUES (1,2), (3,4) --etc, etc.). This should significantly improve performance.

It is also heavily documented, which is always a plus.

+5
Feb 09 2018-11-11T00:
source share

Alternatively, if you need something quick and easy, you can try the following: http://djangosnippets.org/snippets/2362/ . This is a simple manager that I used in the project.

The other piece was not so simple and really focused on volume inserts for relationships. This is just a simple insert and just uses the same INSERT query.

+3
Feb 18 '11 at 22:48
source share
+3
Oct. 20 '11 at 12:59
source share



All Articles