Speeding up select where a column condition exists in another table without duplicates

Question

Speeding up select where a column condition exists in another table without duplicates

If I have the following two tables:

Table "a" with two columns: id (int) [Primary index], column1 [Indexed]
Table "b" with 3 columns: id_table_a (int), condition1 (int), condition2 (int) [all columns as primary pointer]

I can run the following query to select rows from table a, where table b condition1 is 1

SELECT a.id FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id_table_a=a.id && condition1=1 LIMIT 1) ORDER BY a.column1 LIMIT 50

With two hundred million rows in both tables, this query is very slow. If I do this:

 SELECT a.id FROM a INNER JOIN b ON a.id=b.id_table_a && b.condition1=1 ORDER BY a.column1 LIMIT 50

This is pretty fast, but if there are several matching rows in table b that match id_table_a, then duplicates are returned. If I do a SELECT DISTINCT or GROUP BY a.id to remove duplicates, the query becomes very slow.

Here is an SQLFiddle showing sample queries: http://sqlfiddle.com/#!9/35eb9e/10

Is there a way to make a duplicate-free connection quick in this case?

* Edited to show that INNER instead of LEFT join doesn't really matter

* Edited to show that the transition condition to the connection is not a big deal.

* Edited to add LIMIT

* Edited to add ORDER BY

+7

sql database mysql

Jjj Jul 31 '16 at 7:14

source share

5 answers

scaisEdge · Answer 1 · 2016-07-31T07:32:35+0000

You can try with inner connection and various

 SELECT distinct a.id FROM a INNER JOIN b ON a.id=b.id_table_a AND b.condition1=1

but using the selection in select *, make sure that you are not distinguished by an identifier that returns an incorrect result in this case, use

 SELECT distinct col1, col2, col3 .... FROM a INNER JOIN b ON a.id=b.id_table_a AND b.condition1=1

You can also add a composite index using also condtition1 eg: key (id, condition1)

if you can also perform

  ANALYZE TABLE table_name;

on both tables.

and another method tries to return a table of results

 SELECT distinct a.id FROM b INNER JOIN a ON a.id=b.id_table_a AND b.condition1=1

Using the sample table itself to enter a query

Using this is similar to using the http://sqlfiddle.com/#!9/35eb9e/15 index (the latter adds using where)

 # USING DISTINCT TO REMOVE DUPLICATES without col and order EXPLAIN SELECT DISTINCT a.id FROM a INNER JOIN b ON a.id=b.id_table_a AND b.condition1=1 ;

Jjj · Answer 2 · 2016-07-31T13:08:05+0000

Looks like I found the answer.

 SELECT a.id FROM a INNER JOIN b ON b.id_table_a=a.id && b.condition1=1 && b.condition2=(select b.condition2 from b WHERE b.id_table_a=a.id && b.condition1=1 LIMIT 1) ORDER BY a.column1 LIMIT 5;

I do not know if there is a flaw in this or not, please let me know if so. If anyone has a way to compress this, I will gladly accept your answer.

Philipp · Answer 3 · 2016-07-31T07:25:34+0000

 SELECT id FROM a INNER JOIN b ON a.id=b.id_table_a AND b.condition1=1

Take the condition in the ON clause of the join, so the index of table b can be used for filtering. Also use INNER JOIN over LEFT JOIN

Then you should have fewer results to group.

Bohemian · Answer 4 · 2016-07-31T09:59:03+0000

Wrap a quick version of a query that handles deletion and restriction:

 SELECT DISTINCT * FROM ( SELECT a.id FROM a JOIN b ON a.id = b.id_table_a && b.condition1 = 1 ) x ORDER BY column1 LIMIT 50

We know that an internal query is fast. Blurring and ordering should occur. So this happens on the smallest set of rows.

See SQLFiddle .

Option 2:

Try the following:

Create indexes as follows:

 create index a_id_column1 on a(id, column1) create index b_id_table_a_condition1 on b(a_table_a, condition1)

They span indexes — those that contain all the columns needed for a query, which in turn means that accessing data only by index can achieve results.

Then try the following:

 SELECT * FROM ( SELECT a.id, MIN(a.column1) column1 FROM a JOIN b ON a.id = b.id_table_a AND b.condition1 = 1 GROUP BY a.id) x ORDER BY column1 LIMIT 50

Paul spiegel · Answer 5 · 2016-07-31T18:34:13+0000

Use your quick query in the subquery and remove duplicates in the outer select:

 SELECT DISTINCT sub.id FROM ( SELECT a.id FROM a INNER JOIN b ON a.id=b.id_table_a && b.condition1=1 WHERE b.id_table_a > :offset ORDER BY a.column1 LIMIT 50 ) sub

Due to removing duplicates, you can get less than 50 rows. Just repeat the query until you get aough strings. Start with :offset = 0 . Use the last identifier from the last result as :offset in the following queries.

If you know your statistics, you can also use two restrictions. The limit in the inner query must be high enough to return 50 different rows with a probability that is high enough for you.

 SELECT DISTINCT sub.id FROM ( SELECT a.id FROM a INNER JOIN b ON a.id=b.id_table_a && b.condition1=1 ORDER BY a.column1 LIMIT 1000 ) sub LIMIT 50

For example: if you have an average of 10 duplicates per identifier, LIMIT 1000 in an internal query will return an average of 100 different rows. It is very unlikely that you will receive less than 50 rows.

If the condition2 column is boolean, you know that you can have a maximum of two duplicates. In this case, a LIMIT 100 in the internal query will suffice.

Speeding up select where a column condition exists in another table without duplicates

More articles: