Updating millions of records for inline subqueries - optimization methods

I am looking for some tips on how best to optimize this query.

For each entry _piece_detail , which:

  • Contains at least one matching _scan entry (zip, zip_4, zip_delivery_point, serial_number)
  • Company owned by mailing_groups (via a relationship chain)
  • There is either:
    • first_scan_date_time , which is greater than MIN(scan_date_time) related _scan records
    • latest_scan_date_time , which is less than MAX(scan_date_time) related _scan entries

I need to:

  • Set _piece_detail.first_scan_date_time to MIN(_scan.scan_date_time)
  • Set _piece_detail.latest_scan_date_time to MAX(_scan.scan_date_time)

Since I am dealing with millions and millions of records, I am trying to reduce the number of records that I really have to look for. Here are some facts about the data:

  • The _piece_details table is split into job_id , so it seems to make sense to test these checks in the order of _piece_detail.job_id , _piece_detail.piece_id .
  • The scan record table contains more than 100,000,000 records right now and is partitioned (zip, zip_4, zip_delivery_point, serial_number, scan_date_time), which is the same key used to match _scan with _piece_detail (except scan_date_time).
  • Only about 40% of _piece_detail entries are mailing_group , but we don’t know which ones are until we run through the full union ratio.
  • Only about 30% of _scan records are related to _piece_detail with mailing_group .
  • Typically, there are 0 to 4 _scan entries per _piece_detail .

Now I have damn time to find a way to do it in a decent way. First, I started with something like this:

 UPDATE _piece_detail INNER JOIN ( SELECT _piece_detail.job_id, _piece_detail.piece_id, MIN(_scan.scan_date_time) as first_scan_date_time, MAX(_scan.scan_date_time) as latest_scan_date_time FROM _piece_detail INNER JOIN _container_quantity ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id AND _piece_detail.job_id = _container_quantity.job_id INNER JOIN _container_summary ON _container_quantity.container_id = _container_summary.container_id AND _container_summary.job_id = _container_quantity.job_id INNER JOIN _mail_piece_unit ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id AND _container_quantity.job_id = _mail_piece_unit.job_id INNER JOIN _header ON _header.job_id = _piece_detail.job_id INNER JOIN mailing_groups ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company INNER JOIN _scan ON _scan.zip = _piece_detail.zip AND _scan.zip_4 = _piece_detail.zip_4 AND _scan.zip_delivery_point = _piece_detail.zip_delivery_point AND _scan.serial_number = _piece_detail.serial_number GROUP BY _piece_detail.job_id, _piece_detail.piece_id, _scan.zip, _scan.zip_4, _scan.zip_delivery_point, _scan.serial_number ) as t1 ON _piece_detail.job_id = t1.job_id AND _piece_detail.piece_id = t1.piece_id SET _piece_detail.first_scan_date_time = t1.first_scan_date_time, _piece_detail.latest_scan_date_time = t1.latest_scan_date_time WHERE _piece_detail.first_scan_date_time < t1.first_scan_date_time OR _piece_detail.latest_scan_date_time > t1.latest_scan_date_time; 

I thought this might have tried to load too much into memory at once and might have misused the indexes.

Then I thought that I could avoid this huge combined subquery and add two leftjoin subqueries to get min / max like this:

 UPDATE _piece_detail INNER JOIN _container_quantity ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id AND _piece_detail.job_id = _container_quantity.job_id INNER JOIN _container_summary ON _container_quantity.container_id = _container_summary.container_id AND _container_summary.job_id = _container_quantity.job_id INNER JOIN _mail_piece_unit ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id AND _container_quantity.job_id = _mail_piece_unit.job_id INNER JOIN _header ON _header.job_id = _piece_detail.job_id INNER JOIN mailing_groups ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company LEFT JOIN _scan fs ON (fs.zip, fs.zip_4, fs.zip_delivery_point, fs.serial_number) = ( SELECT zip, zip_4, zip_delivery_point, serial_number FROM _scan WHERE zip = _piece_detail.zip AND zip_4 = _piece_detail.zip_4 AND zip_delivery_point = _piece_detail.zip_delivery_point AND serial_number = _piece_detail.serial_number ORDER BY scan_date_time ASC LIMIT 1 ) LEFT JOIN _scan ls ON (ls.zip, ls.zip_4, ls.zip_delivery_point, ls.serial_number) = ( SELECT zip, zip_4, zip_delivery_point, serial_number FROM _scan WHERE zip = _piece_detail.zip AND zip_4 = _piece_detail.zip_4 AND zip_delivery_point = _piece_detail.zip_delivery_point AND serial_number = _piece_detail.serial_number ORDER BY scan_date_time DESC LIMIT 1 ) SET _piece_detail.first_scan_date_time = fs.scan_date_time, _piece_detail.latest_scan_date_time = ls.scan_date_time WHERE _piece_detail.first_scan_date_time < fs.scan_date_time OR _piece_detail.latest_scan_date_time > ls.scan_date_time 

This explains when I convert them to SELECT statements:

 +----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+ | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 844161 | NULL | | 1 | PRIMARY | _piece_detail | eq_ref | PRIMARY,first_scan_date_time,latest_scan_date_time | PRIMARY | 18 | t1.job_id,t1.piece_id | 1 | Using where | | 2 | DERIVED | _header | index | PRIMARY | date_prepared | 3 | NULL | 87 | Using index; Using temporary; Using filesort | | 2 | DERIVED | _piece_detail | ref | PRIMARY,cqt_database_id,zip | PRIMARY | 10 | odms._header.job_id | 9703 | NULL | | 2 | DERIVED | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity | unique | 14 | odms._header.job_id,odms._piece_detail.cqt_database_id | 1 | NULL | | 2 | DERIVED | _mail_piece_unit | eq_ref | PRIMARY,company,job_id_mail_piece_unit | PRIMARY | 14 | odms._container_quantity.mpu_id,odms._header.job_id | 1 | Using where | | 2 | DERIVED | mailing_groups | eq_ref | PRIMARY | PRIMARY | 27 | odms._mail_piece_unit.mpu_company | 1 | Using index | | 2 | DERIVED | _container_summary | eq_ref | unique,container_id,job_id_container_summary | unique | 14 | odms._header.job_id,odms._container_quantity.container_id | 1 | Using index | | 2 | DERIVED | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using index | +----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+ +----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+ | 1 | PRIMARY | _header | index | PRIMARY | date_prepared | 3 | NULL | 87 | Using index | | 1 | PRIMARY | _piece_detail | ref | PRIMARY,cqt_database_id,first_scan_date_time,latest_scan_date_time | PRIMARY | 10 | odms._header.job_id | 9703 | NULL | | 1 | PRIMARY | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity | unique | 14 | odms._header.job_id,odms._piece_detail.cqt_database_id | 1 | NULL | | 1 | PRIMARY | _mail_piece_unit | eq_ref | PRIMARY,company,job_id_mail_piece_unit | PRIMARY | 14 | odms._container_quantity.mpu_id,odms._header.job_id | 1 | Using where | | 1 | PRIMARY | mailing_groups | eq_ref | PRIMARY | PRIMARY | 27 | odms._mail_piece_unit.mpu_company | 1 | Using index | | 1 | PRIMARY | _container_summary | eq_ref | unique,container_id,job_id_container_summary | unique | 14 | odms._header.job_id,odms._container_quantity.container_id | 1 | Using index | | 1 | PRIMARY | fs | index | NULL | updated | 1 | NULL | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) | | 1 | PRIMARY | ls | index | NULL | updated | 1 | NULL | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) | | 3 | DEPENDENT SUBQUERY | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using where; Using index; Using filesort | | 2 | DEPENDENT SUBQUERY | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using where; Using index; Using filesort | +----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+ 

Now, looking at the explanations generated by each, I really can't say what gives me the best bang for my buck. The first shows fewer complete rows when multiplying a column of rows, but the second seems to execute a bit faster.

Is there something that I could do to achieve the same results while increasing performance by changing the structure of the request?

+7
optimization sql mysql
source share
4 answers

Disable index update when performing bulk updates

 ALTER TABLE _piece_detail DISABLE KEYS; UPDATE ....; ALTER TABLE _piece_detail ENABLE KEYS; 

Refer to mysql docs: http://dev.mysql.com/doc/refman/5.0/en/alter-table.html

EDIT: Looking at the mysql documents that I pointed to, I see that the documents indicate this for the MyISAM table, and are clear for other types of tables. Further solutions here: How to disable the index in innodb

+1
source share

There is something that I have been taught, and I strictly follow until today. Create as many temporary tables as you want, avoiding the use of views. This is especially true for UPDATE / DELETE / INSERT as

  • You cannot predict index on views
  • The output tables may not be stored in memory if the result set is large
  • The (MyIsam) / rows (Innodb) table can be locked for a longer time, like every time a derived query is executed. I prefer a temporary table that has a primary key join to the parent table.

And most importantly, it makes your code convenient and understandable.

My approach will be

 CREATE table temp xxx(...) INSERT INTO xxx select q from y inner join z....; UPDATE _piece_detail INNER JOIN xxx on (...) SET ...; 

Always reduce downtime!

+1
source share

Why don't you use subqueries for each connection? Including internal connections?

 INNER JOIN (SELECT field1, field2, field 3 from _container_quantity order by 1,2,3) ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id AND _piece_detail.job_id = _container_quantity.job_id INNER JOIN (SELECT field1, field2, field3 from _container_summary order by 1,2,3) ON _container_quantity.container_id = _container_summary.container_id AND _container_summary.job_id = _container_quantity.job_id 

Of course, you pull a lot into memory without limiting your selections on these internal connections. Using the order of 1,2,3 at the end of each subquery, you create an index for each subquery. Your only index is in the headers, and you are not joining _headers ....

A few suggestions for optimizing this query. Either create the indexes you need in each table, or use the subquery join clauses to manually create the indexes you need on the fly.

Also remember that when you make a left join in a temporary table filled with aggregates, you are just asking for performance issues.

Contains at least one matching _scan entry (zip, zip_4, zip_delivery_point, serial_number)

Umm ... this is your first moment in what you want to do, but none of these fields are indexed?

0
source share

From your explanation results, it seems that the subquery goes through all the rows twice, but what about saving MIN / MAX from the first and using only one left join instead of two?

0
source share

All Articles