I did some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:
- Insert 3000 rows individually and get identifiers from filled objects using Django ORM: 3200ms
- Insert 3,000 rows with Pandas
DataFrame.to_sql() and not get ids: 774ms - Insert 3000 rows with the Django
.bulk_create(Model(**df.to_records())) manager .bulk_create(Model(**df.to_records())) and you wonβt get the identifiers: 574ms - Paste 3000 rows with
to_csv into the StringIO and COPY buffer ( cur.copy_from() ) and do not get the identifiers: 118ms - Insert 3000 rows with
to_csv and COPY and get the identifiers with a simple SELECT WHERE ID > [max ID before insert] (maybe not thread-oriented if COPY holds the table lock, preventing simultaneous inserts?): 201 ms
def bulk_to_sql(df, columns, model_cls): """ Inserting 3000 takes 774ms avg """ engine = ExcelImportProcessor._get_sqlalchemy_engine() df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False) def bulk_via_csv(df, columns, model_cls): """ Inserting 3000 takes 118ms avg """ engine = ExcelImportProcessor._get_sqlalchemy_engine() connection = engine.raw_connection() cursor = connection.cursor() output = StringIO() df[columns].to_csv(output, sep='\t', header=False, index=False) output.seek(0) contents = output.getvalue() cur = connection.cursor() cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns) connection.commit() cur.close()
All performance indicators were obtained for a table that already contains 3,000 rows running on OS X (16 GB i7 SSD), on average, out of ten runs using timeit .
I return my inserted primary keys by assigning an identifier for the import package and sorting by primary key, although I'm not 100% sure that the primary keys will always be assigned in the order in which the rows are serialized for the COPY - I would be grateful for the views anyway ,
Chris Jul 21 '17 at 20:18 2017-07-21 20:18
source share