I am looking for some tips on how best to optimize this query.
For each entry _piece_detail , which:
- Contains at least one matching
_scan entry (zip, zip_4, zip_delivery_point, serial_number) - Company owned by
mailing_groups (via a relationship chain) - There is either:
first_scan_date_time , which is greater than MIN(scan_date_time) related _scan recordslatest_scan_date_time , which is less than MAX(scan_date_time) related _scan entries
I need to:
- Set
_piece_detail.first_scan_date_time to MIN(_scan.scan_date_time) - Set
_piece_detail.latest_scan_date_time to MAX(_scan.scan_date_time)
Since I am dealing with millions and millions of records, I am trying to reduce the number of records that I really have to look for. Here are some facts about the data:
- The _piece_details table is split into
job_id , so it seems to make sense to test these checks in the order of _piece_detail.job_id , _piece_detail.piece_id . - The scan record table contains more than 100,000,000 records right now and is partitioned (zip, zip_4, zip_delivery_point, serial_number, scan_date_time), which is the same key used to match _scan with
_piece_detail (except scan_date_time). - Only about 40% of
_piece_detail entries are mailing_group , but we donβt know which ones are until we run through the full union ratio. - Only about 30% of _scan records are related to
_piece_detail with mailing_group . - Typically, there are 0 to 4
_scan entries per _piece_detail .
Now I have damn time to find a way to do it in a decent way. First, I started with something like this:
UPDATE _piece_detail INNER JOIN ( SELECT _piece_detail.job_id, _piece_detail.piece_id, MIN(_scan.scan_date_time) as first_scan_date_time, MAX(_scan.scan_date_time) as latest_scan_date_time FROM _piece_detail INNER JOIN _container_quantity ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id AND _piece_detail.job_id = _container_quantity.job_id INNER JOIN _container_summary ON _container_quantity.container_id = _container_summary.container_id AND _container_summary.job_id = _container_quantity.job_id INNER JOIN _mail_piece_unit ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id AND _container_quantity.job_id = _mail_piece_unit.job_id INNER JOIN _header ON _header.job_id = _piece_detail.job_id INNER JOIN mailing_groups ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company INNER JOIN _scan ON _scan.zip = _piece_detail.zip AND _scan.zip_4 = _piece_detail.zip_4 AND _scan.zip_delivery_point = _piece_detail.zip_delivery_point AND _scan.serial_number = _piece_detail.serial_number GROUP BY _piece_detail.job_id, _piece_detail.piece_id, _scan.zip, _scan.zip_4, _scan.zip_delivery_point, _scan.serial_number ) as t1 ON _piece_detail.job_id = t1.job_id AND _piece_detail.piece_id = t1.piece_id SET _piece_detail.first_scan_date_time = t1.first_scan_date_time, _piece_detail.latest_scan_date_time = t1.latest_scan_date_time WHERE _piece_detail.first_scan_date_time < t1.first_scan_date_time OR _piece_detail.latest_scan_date_time > t1.latest_scan_date_time;
I thought this might have tried to load too much into memory at once and might have misused the indexes.
Then I thought that I could avoid this huge combined subquery and add two leftjoin subqueries to get min / max like this:
UPDATE _piece_detail INNER JOIN _container_quantity ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id AND _piece_detail.job_id = _container_quantity.job_id INNER JOIN _container_summary ON _container_quantity.container_id = _container_summary.container_id AND _container_summary.job_id = _container_quantity.job_id INNER JOIN _mail_piece_unit ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id AND _container_quantity.job_id = _mail_piece_unit.job_id INNER JOIN _header ON _header.job_id = _piece_detail.job_id INNER JOIN mailing_groups ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company LEFT JOIN _scan fs ON (fs.zip, fs.zip_4, fs.zip_delivery_point, fs.serial_number) = ( SELECT zip, zip_4, zip_delivery_point, serial_number FROM _scan WHERE zip = _piece_detail.zip AND zip_4 = _piece_detail.zip_4 AND zip_delivery_point = _piece_detail.zip_delivery_point AND serial_number = _piece_detail.serial_number ORDER BY scan_date_time ASC LIMIT 1 ) LEFT JOIN _scan ls ON (ls.zip, ls.zip_4, ls.zip_delivery_point, ls.serial_number) = ( SELECT zip, zip_4, zip_delivery_point, serial_number FROM _scan WHERE zip = _piece_detail.zip AND zip_4 = _piece_detail.zip_4 AND zip_delivery_point = _piece_detail.zip_delivery_point AND serial_number = _piece_detail.serial_number ORDER BY scan_date_time DESC LIMIT 1 ) SET _piece_detail.first_scan_date_time = fs.scan_date_time, _piece_detail.latest_scan_date_time = ls.scan_date_time WHERE _piece_detail.first_scan_date_time < fs.scan_date_time OR _piece_detail.latest_scan_date_time > ls.scan_date_time
This explains when I convert them to SELECT statements:
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+ | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 844161 | NULL | | 1 | PRIMARY | _piece_detail | eq_ref | PRIMARY,first_scan_date_time,latest_scan_date_time | PRIMARY | 18 | t1.job_id,t1.piece_id | 1 | Using where | | 2 | DERIVED | _header | index | PRIMARY | date_prepared | 3 | NULL | 87 | Using index; Using temporary; Using filesort | | 2 | DERIVED | _piece_detail | ref | PRIMARY,cqt_database_id,zip | PRIMARY | 10 | odms._header.job_id | 9703 | NULL | | 2 | DERIVED | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity | unique | 14 | odms._header.job_id,odms._piece_detail.cqt_database_id | 1 | NULL | | 2 | DERIVED | _mail_piece_unit | eq_ref | PRIMARY,company,job_id_mail_piece_unit | PRIMARY | 14 | odms._container_quantity.mpu_id,odms._header.job_id | 1 | Using where | | 2 | DERIVED | mailing_groups | eq_ref | PRIMARY | PRIMARY | 27 | odms._mail_piece_unit.mpu_company | 1 | Using index | | 2 | DERIVED | _container_summary | eq_ref | unique,container_id,job_id_container_summary | unique | 14 | odms._header.job_id,odms._container_quantity.container_id | 1 | Using index | | 2 | DERIVED | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using index | +----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+ +----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+ | 1 | PRIMARY | _header | index | PRIMARY | date_prepared | 3 | NULL | 87 | Using index | | 1 | PRIMARY | _piece_detail | ref | PRIMARY,cqt_database_id,first_scan_date_time,latest_scan_date_time | PRIMARY | 10 | odms._header.job_id | 9703 | NULL | | 1 | PRIMARY | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity | unique | 14 | odms._header.job_id,odms._piece_detail.cqt_database_id | 1 | NULL | | 1 | PRIMARY | _mail_piece_unit | eq_ref | PRIMARY,company,job_id_mail_piece_unit | PRIMARY | 14 | odms._container_quantity.mpu_id,odms._header.job_id | 1 | Using where | | 1 | PRIMARY | mailing_groups | eq_ref | PRIMARY | PRIMARY | 27 | odms._mail_piece_unit.mpu_company | 1 | Using index | | 1 | PRIMARY | _container_summary | eq_ref | unique,container_id,job_id_container_summary | unique | 14 | odms._header.job_id,odms._container_quantity.container_id | 1 | Using index | | 1 | PRIMARY | fs | index | NULL | updated | 1 | NULL | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) | | 1 | PRIMARY | ls | index | NULL | updated | 1 | NULL | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) | | 3 | DEPENDENT SUBQUERY | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using where; Using index; Using filesort | | 2 | DEPENDENT SUBQUERY | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using where; Using index; Using filesort | +----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
Now, looking at the explanations generated by each, I really can't say what gives me the best bang for my buck. The first shows fewer complete rows when multiplying a column of rows, but the second seems to execute a bit faster.
Is there something that I could do to achieve the same results while increasing performance by changing the structure of the request?