What is the performance difference in MySQL relational division (IN AND instead of IN OR)?

Since MySQL does not have a built-in relational division operator, programmers must implement their own. There are two leading implementation examples that can be found in here.

For posterity, I have listed them below:

Using GROUP BY / HAVING


SELECT t.documentid FROM TABLE t WHERE t.termid IN (1,2,3) GROUP BY t.documentid HAVING COUNT(DISINCT t.termid) = 3 

The caveat is that you should use HAVING COUNT (DISTINCT, because duplicate termid equal to 2 for one document will be false positive. And COUNT should equal the number of termid values ​​in the IN clause.

Using JOINs


 SELECT t.documentid FROM TABLE t JOIN TABLE x ON x.termid = t.termid AND x.termid = 1 JOIN TABLE y ON y.termid = t.termid AND y.termid = 2 JOIN TABLE z ON z.termid = t.termid AND z.termid = 3 

But it can be a pain to handle criteria that change a lot.

Of these two implementation methods that will offer the best performance?

+1
source share
1 answer

I made some improvements to the JOIN version; see below.

I voted for the JOIN approach for speed. Here is how I defined it:

HAVING version 1

 mysql> FLUSH STATUS; mysql> SELECT city -> FROM us_vch200 -> WHERE state IN ('IL', 'MO', 'PA') -> GROUP BY city -> HAVING count(DISTINCT state) >= 3; +-------------+ | city | +-------------+ | Springfield | | Washington | +-------------+ mysql> SHOW SESSION STATUS LIKE 'Handler%'; +----------------------------+-------+ | Variable_name | Value | +----------------------------+-------+ | Handler_external_lock | 2 | | Handler_read_first | 1 | | Handler_read_key | 2 | | Handler_read_last | 1 | | Handler_read_next | 4175 | -- full index scan (etc) +----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+ | 1 | SIMPLE | us_vch200 | range | state_city,city_state | city_state | 769 | NULL | 4176 | Using where; Using index for group-by (scanning) | +----+-------------+-----------+-------+-----------------------+------------+---------+------+------+--------------------------------------------------+ 

"Extra" indicates that he decided to do GROUP BY and use INDEX(city, state) , although INDEX(state, city) may make sense.

HAVING, version 2

Switching to INDEX(state, city) gives:

 mysql> FLUSH STATUS; mysql> SELECT city -> FROM us_vch200 IGNORE INDEX(city_state) -> WHERE state IN ('IL', 'MO', 'PA') -> GROUP BY city -> HAVING count(DISTINCT state) >= 3; +-------------+ | city | +-------------+ | Springfield | | Washington | +-------------+ mysql> SHOW SESSION STATUS LIKE 'Handler%'; +----------------------------+-------+ | Variable_name | Value | +----------------------------+-------+ | Handler_commit | 1 | | Handler_external_lock | 2 | | Handler_read_key | 401 | | Handler_read_next | 398 | | Handler_read_rnd | 398 | (etc) +----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+ | 1 | SIMPLE | us_vch200 | range | state_city,city_state | state_city | 2 | NULL | 397 | Using where; Using index; Using filesort | +----+-------------+-----------+-------+-----------------------+------------+---------+------+------+------------------------------------------+ 

Join

 mysql> SELECT x.city -> FROM us_vch200 x -> JOIN us_vch200 y ON y.city= x.city AND y.state = 'MO' -> JOIN us_vch200 z ON z.city= x.city AND z.state = 'PA' -> WHERE x.state = 'IL'; +-------------+ | city | +-------------+ | Springfield | | Washington | +-------------+ 2 rows in set (0.00 sec) mysql> SHOW SESSION STATUS LIKE 'Handler%'; +----------------------------+-------+ | Variable_name | Value | +----------------------------+-------+ | Handler_commit | 1 | | Handler_external_lock | 6 | | Handler_read_key | 86 | | Handler_read_next | 87 | (etc) +----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+ | 1 | SIMPLE | y | ref | state_city,city_state | state_city | 2 | const | 81 | Using where; Using index | | 1 | SIMPLE | z | ref | state_city,city_state | state_city | 769 | const,world.y.city | 1 | Using where; Using index | | 1 | SIMPLE | x | ref | state_city,city_state | state_city | 769 | const,world.y.city | 1 | Using where; Using index | +----+-------------+-------+------+-----------------------+------------+---------+--------------------+------+--------------------------+ 

Only INDEX(state, city) required. The number of handlers is the smallest for this formulation, so I think it is the fastest.

Notice how the optimizer himself decided which table to start with, probably due to

 +-------+----------+ | state | COUNT(*) | +-------+----------+ | IL | 221 | | MO | 81 | -- smallest | PA | 96 | +-------+----------+ 

conclusions

JOIN (without unnecessary table t ) is probably the fastest. Plus this composite index is needed: INDEX(state, city) .

To translate back to your use case:

 city --> documentid state --> termid 

Caution: YMMV, because the distribution of values ​​for documentid and termid can be very different from the test case used.

+1
source

All Articles