A condition on a joined table is faster than a condition by reference

Question

A condition on a joined table is faster than a condition by reference

I have a query that contains two tables: table A has many rows and contains a field called b_id that refers to a record from table B that contains about 30 different rows. Table A has an index on b_id , and table B has an index in the name column.

My query looks something like this:

 SELECT COUNT(A.id) FROM A INNER JOIN B ON B.id = A.b_id WHERE (B.name != 'dummy') AND <condition>;

If condition is some random condition in table A (I have many such, all of which have the same behavior).

This query is very slow (from the north for 2 seconds) and with the help of the explanation shows that the query optimizer starts with table B , which contains about 29 rows, and then scans table A Performing STRAIGHT_JOIN , turned the order around and the query started instantly.

I'm not a fan of black magic, so I decided to try something else: come up with an identifier for writing to B that has the name dummy , say 23, and then simplify the query to:

 SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;

To my surprise, this query was actually slower than a direct connection, taking the north of a second.

Any ideas on why the connection would be faster than a simplified query?

UPDATE: after the request in the comments, exits from the explanation:

Direct connection:

 +----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+ | 1 | SIMPLE | A | ALL | b_id | NULL | NULL | NULL | 200707 | Using where | | 1 | SIMPLE | B | eq_ref | PRIMARY,id_name | PRIMARY | 4 | schema.A.b_id | 1 | Using where | +----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+

No connection:

 +----+-------------+-------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | A | ALL | b_id | NULL | NULL | NULL | 200707 | Using where | +----+-------------+-------+------+---------------+------+---------+------+--------+-------------+

UPDATE 2: Tried another option:

SELECT COUNT(A.id) FROM A WHERE b_id IN (<all the ids except for 23>) AND <condition>;

This works faster than no join, but still slower than joining, so it seems that the inequality operation is responsible for part of the performance, but not for everyone.

+7

sql join mysql

On freund Oct 23 '13 at 22:12

source share

5 answers

Richard Harrison · Answer 1 · 2013-10-29T08:21:54+0000

If you are using MySQL 5.6 or later, you can ask the query optimizer what it does,

 SET optimizer_trace="enabled=on"; ## YOUR QUERY SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11; ##END YOUR QUERY SELECT trace FROM information_schema.optimizer_trace; SET optimizer_trace="enabled=off";

You will almost certainly need to refer to the following sections in the MySQL Tracking Optimizer and Optimizer link

Looking at the first explanation, it seems that the query is faster, probably because the optimizer can use table B to filter down to the required rows based on the join, and then use the foreign key to get the rows in table A

In the explanation, it is this bit that is interesting; there is only one string match, and schema.A.b_id used. Effectively this is pre-filtering strings from A , where I think a performance difference is occurring.

  | ref | rows | Extra | | schema.A.b_id | 1 | Using where |

So, as usual with queries, it all comes down to indexes - or rather, missing indexes. Just because you have indexes on individual fields, this does not necessarily mean that they are suitable for the query that you are using.

Basic rule: if EXPLAIN does not say Using an index , you need to add a suitable index.

Looking at the conclusion of the explanation, the first interesting has an ironic consequence on each line; namely Extra

In the first example, we see that

 | 1 | SIMPLE | A | .... Using where | | 1 | SIMPLE | B | ... Using where |

Both of these Uses where not suitable; ideally at least one, and preferably both should say Index usage

When you do

 SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;

and see Using where , then you need to add an index that scans the table.

for example if you did

 EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23)

You should see Use where; Using an index (assuming Id is a primary key and has an index)

If you added a condition to the end

 EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23) and Field > 0

and see Using where , then you need to add an index for two fields. Just having an index in a field does not mean that MySQL will be able to use this index during a query in several fields - this is what the query designer will decide. I am not entirely sure of the internal rules; but as a rule, adding an additional index to match the query is very helpful.

Therefore, adding an index (in two fields in the query above):

 ALTER TABLE `A` ADD INDEX `IndexIdField` (`Id`,`Field`)

must change it so that when querying based on these two fields, an index appears.

I tried this in one of my databases with the Transactions and User tables.

I will use this request

 EXPLAIN SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11;

Run without an index in two fields:

 PRIMARY,user PRIMARY 4 NULL 14334 Using where

Then add the index:

 ALTER TABLE `transactions` ADD INDEX `IndexIdUser` (`id`, `user`);

Then the same request again and this time

 PRIMARY,user,Index 4 Index 4 4 NULL 12628 Using where; Using index

This time it uses indexes - and the result will be much faster.

From comments from @Wrikken - and also remember that I don't have the exact schema / data, so some of these studies require assumptions about the schema (which might be wrong)

 SELECT COUNT(A.id) FROM A FORCE INDEX (b_id) would perform at least as good as SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.

If we look at the first EXPLAIN in the OP, we will see that there are two elements in the query. Turning to the EXPLAIN documentation for * eq_ref *, I see that this will determine the lines to consider based on this relationship.

The order in which an explanation is output does not necessarily mean that it does one thing and then another; it's just what was chosen to fulfill the request (at least as far as I can tell).

For some reason, the query optimizer decided not to use the index on b_id - I assume that because of the query, the optimizer decided that it would be more efficient to scan the table.

The second explanation bothers me a bit because it does not account for the index on b_id ; possibly due to AND <condition> (which is omitted, so I guess it could be). When I try to do this with an index on b_id , it uses the index; but as soon as the condition is added, it does not use the index.

So, when doing

  SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.

This all indicates that the PRIMARY index on B is the place where the speed difference occurs; I assume due to schema.A.b_id in the explanation that there is a foreign key in this table; which should be a better assembly of related rows than the index on b_id - so the query optimizer can use this relation to determine which rows to choose - and because the primary index is better than the secondary indexes, it will be much faster to select rows from B, and then use the link link to match the strings in A.

newtover · Answer 2 · 2013-11-01T15:05:06+0000

I don’t see anything strange here. You need to understand how MySQL uses indexes. Here is an article that I usually recommend: 3 ways MySQL uses indexes .

It's always funny to watch people write things like WHERE (B.name != 'dummy') AND <condition> , because this AND <condition> may be the reason that the MySQL optimizer has chosen a specific index, and there is no good reason to compare query performance with that of another with WHERE b_id != 23 AND <condition> , because two queries typically require different indexes to execute.

One thing you should understand is that MySQL loves comparison comparisons and does not like range changes and inequality comparisons. It is usually better to specify the correct values than to use a range condition or specify a value != .

So let's compare the two queries.

With direct connection

For each line in the order A.id (which is the primary key and is clustered, that is, the data is stored in its order on disk) take data for the line from disk to check if your <condition> and b_id are met, then (I repeat for each corresponding line) find the corresponding line for b_id, go to disk, take b.name, compare it with 'dummy'. Despite the fact that this plan is not at all effective, you only have 200,000 rows in table A to make it look pretty impressive.

No direct connection

For each row in table B, compare, if the name is comparable, look at the index A.b_id (which is obviously sorted by b_id, since it is an index, therefore, contains Aids in random order) and for each A.id for this A. b_id find the corresponding line A on the disk to check the <condition> if it matches the number id, otherwise discard the line.

As you can see, there is nothing strange in the fact that the second query takes a very long time, you basically force MySQL to randomly access almost every row of table A, where in the first query you read table A in the order it is stored on disk.

A connectionless query does not use any index at all. In fact, it should be about the same as a direct join request. I assume that the order of b_id!=23 and <condition> significant.

UPD1: Could you compare the performance of your request without connecting with the following:

 SELECT COUNT(A.id) FROM A WHERE IF(b_id!=23, <condition>, 0);

UPD2: the fact that you do not see the index in EXPLAIN does not mean that the index is not used at all. The index is at least used to determine the reading order: when there is no other useful index, it is usually the primary key, but, as I said above, when there is an equilibrium condition and the corresponding index, MySQL will use the index, So basically to To understand which index is used, you can look at the order of the output lines. If the order is the same as the primary key, then the index was not used (that is, the index of the primary key), if the order of the lines was shuffled - what was involved with any other index.

In your case, the second condition seems true for most rows, but the index is still used, that is, to get b_id, MySQL goes to disk in random order, so it is slow. There is no black magic here, and this second condition affects performance.

Mosty mostacho · Answer 3 · 2013-10-29T05:10:27+0000

This should probably be a comment, not an answer, but it will be a little long.

First of all, it’s hard to believe that two queries that (almost) explain in the same way work at different speeds. Moreover, it is less likely if one with an additional line in the explanation is faster. And I think the word faster is the key.

You have compared the speed (the time required to complete the request), and this is an extremely empirical way of testing. For example, you might incorrectly disable the cache, which makes this comparison useless. Not to mention the fact that your <insert your preferred software application here> could have made a page error or any other operation during the launch of the test, which could lead to a decrease in the request speed.

The correct way to measure query performance is based on an explanation (why is it there)

So, the closest thing I have to answer the question is: any ideas on why the connection will be faster than a simplified request? ..., in short, a level 8 error.

I have some other comments, however, this must be taken into account in order to speed up the process. If A.id is the primary key (the name smells as it is), according to your explanation, why should count(A.id) scan all the lines? It should be able to get data directly from the index, but I do not see Using index in additional flags. It seems you don't even have a unique index, and it is not a field with a null value. It also smells weird. Make sure that the field is not empty and that it has a unique index, run the explanation again, confirm that the additional flags contain the Using index and then (correctly) the request time. It should work much faster.

Also note that an approach that would lead to the same performance improvement as I mentioned above was to replace count(A.id) with count(*) .

Only my 2 cents.

Dennis c · Answer 4 · 2013-10-29T05:56:25+0000

Because MySQL will not use the index for index!=val where.

The optimizer decides to use the index, guessing. Since "! =" Is likely to get everything, skip and prevent the use of the index to reduce overhead. (yes, mysql is stupid and it is not a statistical index column)

You can make a faster SELECT using index in(everything other then val) that MySQL will learn how to use the index.

An example here showing a query optimizer will not use an index by value

devinbost · Answer 5 · 2013-10-30T19:29:45+0000

The answer to this question is actually a very simple consequence of the development of the algorithm:

The key difference between the two queries is the merge operation.

Before giving a lesson on algorithms, I will talk about the reason why the merge operation improves performance. Merging increases productivity because it reduces the overall load on aggregation. This is a problem of iteration and recursion. In an iterative analogy, we simply iterate over the entire index and count the matches. In analogy with recursion, we share and win (so to speak); or, in other words, we filter the results that we need to take into account, thereby reducing the amount of numbers that we really need to count.

Here are the key questions:

Why is merge sort faster than insert sort?
Is merge sort always faster than insert sort?

Let us explain this with the parable:

Let's say we have a deck of playing cards, and we need to summarize the number of playing cards with numbers 7, 8 and 9 (if we do not know the answer in advance).

Let's say that we solve two ways to solve this problem:

We can hold the deck in one hand and move the cards to the table, one by one, counting when we go.
We can divide the cards into two groups: black suits and red suits. Then we can perform step 1 on one of the groups and reuse the results for the second group.

If we choose option 2, we will split our problem in half. As a result, we can count the matching black cards and multiply the number by 2. In other words, we will reuse the part of the query execution plan that required counting. This argumentation especially works when we know in advance how the cards were sorted (for example, a “clustered index”). Counting half the cards is obviously much less time consuming than counting the entire deck.

If we wanted to improve performance once again, depending on how large our database is, we might even consider sorting into four groups (instead of two groups): clubs, diamonds, hearts and spades. Regardless of whether we want to carry out this next step, it depends on whether the overhead of sorting cards in additional groups is justified to increase productivity. In a small number of cards, the performance gain is probably not worth the extra overhead required to sort in different groups. As the number of cards grows, productivity gains begin to outweigh the overhead.

Here is an excerpt from Introduction to Algorithms, 3rd Edition (Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein): (Note: If someone can tell me how to format the sub-notation, I will edit this to improve readability.)

(Also, keep in mind that “n” is the number of objects we deal with.)

“As an example, in Chapter 2, we will see two sorting algorithms. The first, known as insertion sorting, takes approximately c1n2 to sort n elements, where c1 is a constant that is independent of n. That is, it takes time roughly proportional to n2 Second, merge sort takes a time approximately equal to c2n lg n, where lg n means log2 n and c2 is another constant that is also independent of n. Insertion sorting usually has a lower constant coefficient than merge sort, so c1 < c2. We will see that the constants actors can have a much smaller impact on runtime than dependence on input size n. Allows you to write insertion sort time as c1n · n and merge sorts run time as c2n · log n. Then we see that where the sort insert has coefficient n for At run time, merge sorting has a factor ln n which is much less. (For example, when n = 1000, lg n is about 10, and when n is one million, lg n is about 20) Although insert sorting is usually faster than merge sort For small input sizes, after entering size n, the advantage of log n compared to n will be greater, the more to compensate for the difference in constant factors. No matter how much less c1 than c2, there will always be an intersection point outside that sort the sort faster. "

Why is this relevant? Let's look at query execution plans for these two queries. We will see that there is a merge operation caused by the inner join.

A condition on a joined table is faster than a condition by reference

More articles: