What is the performance difference in MySQL relational division (IN AND instead of IN OR)?

Question

What is the performance difference in MySQL relational division (IN AND instead of IN OR)?

Since MySQL does not have a built-in relational division operator, programmers must implement their own. There are two leading implementation examples that can be found in here.

For posterity, I have listed them below:

Using GROUP BY / HAVING
SELECT t.documentid FROM TABLE t WHERE t.termid IN (1,2,3) GROUP BY t.documentid HAVING COUNT(DISINCT t.termid) = 3 
The caveat is that you should use HAVING COUNT (DISTINCT, because duplicate termid equal to 2 for one document will be false positive. And COUNT should equal the number of termid values in the IN clause.
Using JOINs
 SELECT t.documentid FROM TABLE t JOIN TABLE x ON x.termid = t.termid AND x.termid = 1 JOIN TABLE y ON y.termid = t.termid AND y.termid = 2 JOIN TABLE z ON z.termid = t.termid AND z.termid = 3 
But it can be a pain to handle criteria that change a lot.

Of these two implementation methods that will offer the best performance?

+1

database mysql relational-database database-design

true Jun 28 '15 at 15:27

source share

1 answer

Rick james · Accepted Answer · 2015-06-28T17:44:30+0000

I made some improvements to the JOIN version; see below.

I voted for the JOIN approach for speed. Here is how I defined it:

HAVING version 1

 mysql> FLUSH STATUS; mysql> SELECT city -> FROM us_vch200 -> WHERE state IN ('IL', 'MO', 'PA') -> GROUP BY city -> HAVING count(DISTINCT state) >= 3; +-------------+ | city | +-------------+ | Springfield | | Washington | +-------------+ mysql> SHOW SESSION STATUS LIKE 'Handler%'; +----------------------------+-------+ | Variable_name | Value | +----------------------------+-------+ | Handler_external_lock | 2 | | Handler_read_first | 1 | | Handler_read_key | 2 | | Handler_read_last | 1 | | Handler_read_next | 4175 | -- full index scan (etc) +----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+ | 1 | SIMPLE | us_vch200 | range | state_city,city_state | city_state | 769 | NULL | 4176 | Using where; Using index for group-by (scanning) | +----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+

"Extra" indicates that he decided to do GROUP BY and use INDEX(city, state) , although INDEX(state, city) may make sense.

HAVING, version 2

Switching to INDEX(state, city) gives:

 mysql> FLUSH STATUS; mysql> SELECT city -> FROM us_vch200 IGNORE INDEX(city_state) -> WHERE state IN ('IL', 'MO', 'PA') -> GROUP BY city -> HAVING count(DISTINCT state) >= 3; +-------------+ | city | +-------------+ | Springfield | | Washington | +-------------+ mysql> SHOW SESSION STATUS LIKE 'Handler%'; +----------------------------+-------+ | Variable_name | Value | +----------------------------+-------+ | Handler_commit | 1 | | Handler_external_lock | 2 | | Handler_read_key | 401 | | Handler_read_next | 398 | | Handler_read_rnd | 398 | (etc) +----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+ | 1 | SIMPLE | us_vch200 | range | state_city,city_state | state_city | 2 | NULL | 397 | Using where; Using index; Using filesort | +----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+

Join

 mysql> SELECT x.city -> FROM us_vch200 x -> JOIN us_vch200 y ON y.city= x.city AND y.state = 'MO' -> JOIN us_vch200 z ON z.city= x.city AND z.state = 'PA' -> WHERE x.state = 'IL'; +-------------+ | city | +-------------+ | Springfield | | Washington | +-------------+ 2 rows in set (0.00 sec) mysql> SHOW SESSION STATUS LIKE 'Handler%'; +----------------------------+-------+ | Variable_name | Value | +----------------------------+-------+ | Handler_commit | 1 | | Handler_external_lock | 6 | | Handler_read_key | 86 | | Handler_read_next | 87 | (etc) +----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+ | 1 | SIMPLE | y | ref | state_city,city_state | state_city | 2 | const | 81 | Using where; Using index | | 1 | SIMPLE | z | ref | state_city,city_state | state_city | 769 | const,world.y.city | 1 | Using where; Using index | | 1 | SIMPLE | x | ref | state_city,city_state | state_city | 769 | const,world.y.city | 1 | Using where; Using index | +----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+

Only INDEX(state, city) required. The number of handlers is the smallest for this formulation, so I think it is the fastest.

Notice how the optimizer himself decided which table to start with, probably due to

 +-------+----------+ | state | COUNT(*) | +-------+----------+ | IL | 221 | | MO | 81 | -- smallest | PA | 96 | +-------+----------+

conclusions

JOIN (without unnecessary table t ) is probably the fastest. Plus this composite index is needed: INDEX(state, city) .

To translate back to your use case:

 city --> documentid state --> termid

Caution: YMMV, because the distribution of values for documentid and termid can be very different from the test case used.

What is the performance difference in MySQL relational division (IN AND instead of IN OR)?

Using GROUP BY / HAVING

Using JOINs

More articles: