How can I improve the performance of the INSERT statement?

While Josh answered here , I gave a good overview of how to insert an array of 256x64x250 values โ€‹โ€‹into a MySQL database. When I actually tried my INSERT statement for my data, it turned out to be terribly slow (as after 6 minutes for a 16 MB file).

ny, nx, nz = np.shape(data) query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s)""" for frames in range(nz): for rows in range(ny): for cols in range(nx): cursor.execute(query, (frames, rows, cols, data[rows,cols,frames])) 

I read MySQL for Python , which explained that this is the wrong approach, since doing 4 million separate inserts is very inefficient.

Now my data consists of a lot of zeros (90% actually), so I wrote an IF statement, so I insert values โ€‹โ€‹greater than zero, and instead I used executemany ()

 query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s ) """ values = [] for frames in range(nz): for rows in range(ny): for cols in range(nx): if data[rows,cols,frames] > 0.0: values.append((frames, rows, cols, data[rows,cols,frames])) cur.executemany(query, values) 

This miraculously led my processing time to 20 seconds, of which 14 seconds were spent creating a list of values (37 thousand rows) and 4 seconds to actually insert into the database.

So now I am wondering how can I speed up this process? . It feels like my loop is terribly inefficient and there should be a better way. If I need to insert 30 measurements per dog, it still takes 10 minutes, which seems too large for this amount of data.

Here are two versions of my raw files: with or without headers . I would like to try LOAD DATA INFILE, but I cannot figure out how to parse the data correctly.

+6
python mysql
source share
6 answers

If the data is a numpy array, you can try the following:

 query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s ) """ values = [] rows, cols, frames = numpy.nonzero(data) for row, col, frame in zip(rows, cols, frames): values.append((frame, row, col, data[row,col,frame])) cur.executemany(query, values) 

or

 query = """INSERT INTO `data` (frame, sensor_row, sensor_col, value) VALUES (%s, %s, %s, %s ) """ rows, cols, frames = numpy.nonzero(data) values = [(row, col, frame, val) for row, col, frame, val in zip(rows, cols, frames, data[rows,cols,frames])] cur.executemany(query, values) 

Hope this helps

+5
source share

The fastest way to insert 4 million rows (16 MB of data) is to use the infile download data - http://dev.mysql.com/doc/refman/5.0/en/load-data.html

therefore, if possible, generate the csv file, then use the infile download data ..

hope this helps :)

EDIT

So, I took one of your source rolloff.dat data files and wrote a quick and dirty program to convert it to the following csv format.

Download frames.dat from here: http://rapidshare.com/files/454896698/frames.dat

Frames.dat

 patient_name, sample_date dd/mm/yyyy, frame_time (ms), frame 0..248, row 0..255, col 0..62, value "Krulle (opnieuw) Krupp",04/03/2010,0.00,0,5,39,0.4 "Krulle (opnieuw) Krupp",04/03/2010,0.00,0,5,40,0.4 ... "Krulle (opnieuw) Krupp",04/03/2010,0.00,0,10,42,0.4 "Krulle (opnieuw) Krupp",04/03/2010,0.00,0,10,43,0.4 "Krulle (opnieuw) Krupp",04/03/2010,7.94,1,4,40,0.4 "Krulle (opnieuw) Krupp",04/03/2010,7.94,1,5,39,0.4 "Krulle (opnieuw) Krupp",04/03/2010,7.94,1,5,40,0.7 "Krulle (opnieuw) Krupp",04/03/2010,7.94,1,6,44,0.7 "Krulle (opnieuw) Krupp",04/03/2010,7.94,1,6,45,0.4 ... "Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,10,0.4 "Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,11,0.4 "Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,12,1.1 "Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,13,1.4 "Krulle (opnieuw) Krupp",04/03/2010,1968.25,248,241,14,0.4 

The file contains data only for frames that have values โ€‹โ€‹for each line and col, therefore zeros are excluded. 24,799 data lines were created from your source file.

Then I created a temporary (intermediate) loading table into which the frames.dat file is loaded. This is a temporary table that will allow you to manipulate / transform the data before loading into the corresponding production / reporting tables.

 drop table if exists sample_temp; create table sample_temp ( patient_name varchar(255) not null, sample_date date, frame_time decimal(6,2) not null default 0, frame_id tinyint unsigned not null, row_id tinyint unsigned not null, col_id tinyint unsigned not null, value decimal(4,1) not null default 0, primary key (frame_id, row_id, col_id) ) engine=innodb; 

It remains only to load the data (note: I use windows, so you will need to edit this script to make it compatible with linux - check the path names and change '\ r \ n' to '\ n')

 truncate table sample_temp; start transaction; load data infile 'c:\\import\\frames.dat' into table sample_temp fields terminated by ',' optionally enclosed by '"' lines terminated by '\r\n' ignore 1 lines ( patient_name, @sample_date, frame_time, frame_id, row_id, col_id, value ) set sample_date = str_to_date(@sample_date,'%d/%m/%Y'); commit; Query OK, 24799 rows affected (1.87 sec) Records: 24799 Deleted: 0 Skipped: 0 Warnings: 0 

24K rows were downloaded in 1.87 seconds.

Hope this helps :)

+5
source share

I do not use Python or mySQL, but batch insertion performance can often be accelerated with transactions.

+1
source share

If I understand this correctly, executeemany () executes an INSERT INTO query for each row that you want to insert. This can be improved by creating a single INSERT query with all values, which should look like this:

 INSERT INTO data (frame, sensor_row, sensor_col, value) VALUES (1, 1, 1, 1), (2, 2, 2, 2), (3, 3, 3, 3), ... 

Your python code should generate the string values โ€‹โ€‹in brackets and create one query string from it in order to finally execute the query once.

0
source share

Inserting multiple lines into each statement is one way to optimize. However, why 3 loops? Perhaps data conversion might be useful instead.

Another option is to disable indexes during insertion if you are sure that you will not have duplicate data (provided that you actually have indexes in the table). Indexes should be updated for each operator, as well as to prevent duplication.

Call ALTER TABLE tablename DISABLE KEYS before starting your inserts and after calling ALTER TABLE tablename ENABLE KEYS and see if it helps

From the manual:

ALTER TABLE ... DISABLE KEYS tells MySQL to stop updating non-standard indexes. ALTER TABLE ... ENABLE KEYS should be used to recreate the missing indexes. MySQL does this with a special algorithm, which is much faster than inserting keys one by one, so disabling keys before performing bulk insert operations should give significant speedup. Using ALTER TABLE ... DISABLE KEYS requires the INDEX privilege in addition to the privileges mentioned earlier.

0
source share

Instead of for loops:

you can use list counting,
 values = [(frames, rows, cols, data[rows,cols,frames]) \ for frames in range(nz) for rows in range(ny) \ for cols in range(nx) if data[rows,cols,frames] > 0.0] 

I would appreciate that this can give you a small speed, for example 10-20%.

-1
source share

All Articles