SQL JOIN Query to return rows where we did NOT find a match in the joined table

More theory / logic questions, but I have two tables: links and options . Links is a table in which I add rows that represent the relationship between the product identifier (in a separate products table) and the option. The options table contains all available options.

What I am trying to do (but struggling to create logic) is to join the two tables, returning only rows that do not have a link to the options in the links table, so representing what parameters are still available for adding to the product.

Is there an SQL function that can help me here? I'm not very good at SQL yet.

+6
source share
3 answers

Your table design sounds great.

If this query returns the id values ​​of the "parameters" associated with a particular "product" ...

 SELECT k.option_id FROM links k WHERE k.product_id = 'foo' 

Then this request will receive information about all the parameters associated with the "product"

 SELECT o.id , o.name FROM options o JOIN links k ON k.option_id = o.id WHERE k.product_id = 'foo' 

Note that we can indeed move the predicate "product_id='foo'" from the WHERE clause to the ON JOIN clause for an equivalent result, for example

 SELECT o.id , o.name FROM options o JOIN links k ON k.option_id = o.id AND k.product_id = 'foo' 

(Not that it makes any difference here, but it would be useful if we used OUTER JOIN (in the WHERE clause, it would negate the "external nature" of the join and make it equivalent to INNER JOIN.)

But, none of this answers your question, it only creates the basis for answering your question:

How do we get strings from "parameters" that are NOT related to a particular product?

The most effective approach is the (usually) anti-join pattern.

What is it, we will get all the lines from the “parameters” along with any matching lines from the “links” (for a specific product_id, in your case). This result set will contain strings of “parameters” that do not have a corresponding string in “links”.

The “trick” is to filter out all lines that have the corresponding lines (lines) found in the “links”. This will leave us only lines that did not match.

And as we filter these lines, we use the predicate in the WHERE clause, which checks if a match has been found. We do this by checking the columns, which, as we know, will be NOT NULL if the corresponding row is found. And we know for sure that the column will be NULL if the found row matches NOT .

Something like that:

 SELECT o.id , o.name FROM options o LEFT JOIN links k ON k.option_id = o.id AND k.product_id = 'foo' WHERE k.option_id IS NULL 

The keyword "LEFT" sets the operation of the "external" connection, we get all the rows from the "parameters" (table "left" JOIN), even if the corresponding row is not found. (A normal inner join filters out rows that do not have a match.)

The "trick" is in the WHERE clause ... if we find a matching row from the links, we know that the column "option_id" returned from the "links" will not be NULL. It cannot be NULL if it is “equal” to something, and we know that it should have been “equal” due to the predicate in the ON clause.

So, we know that rows from options that do not have a match will be NULL for this column.

It takes a little for your brain to wrap around it, but the anti-compound is quickly becoming a familiar pattern.


The anti-join template is not the only way to get a set of results. There are several other approaches.

One option is to use a query with the predicate "NOT EXISTS" with a correlated subquery. This is somewhat easier to understand, but usually fails:

 SELECT o.id , o.name FROM options o WHERE NOT EXISTS ( SELECT 1 FROM links k WHERE k.option_id = o.id AND k.product_id = 'foo' ) 

This suggests that I get all the rows from the options table. But for each row, run a query and see if the corresponding row exists in the link table. (No matter what is returned in the select list, we check to see if it returns at least one row ... I use "1" in the select list to remind me that I am looking for "1 row".

Usually this does not work, as does the anti-join, but sometimes it works faster, especially if other predicates in the WHERE clause of the outer query filter almost every row, and the subquery should only run for a couple of rows. (That is, when we only need to check a few needles in the haystack. When we need to process the entire haystack, the anti-attachment pattern is usually faster.)

And the beginning query, which you are likely to see, is NOT IN (subquery) . I am not even going to give an example of this. If you have a list of literals, then by all means, use NOT IN. But with a subquery, he is rarely the best performer, although it seems to be the easiest to understand.

Oh, what hay, I will also give a demo (not that I urge you to do it like this):

 SELECT o.id , o.name FROM options o WHERE o.id NOT IN ( SELECT k.option_id FROM links k WHERE k.product_id = 'foo' AND k.option_id IS NOT NULL GROUP BY k.option_id ) 

This subquery (inside parens) gets a list of all the option_id values ​​associated with the product.

Now, for each line in the parameters (in the external query), we can check the id value to see if it was returned in this list by a subquery.

If we have a guarantee that option_id will never be NULL, we can omit the predicate that tests for "option_id IS NOT NULL" . (In the more general case, when NULL enters the result set, the external query cannot determine whether o.id is in the list or not, and the query does not return any rows, so I usually include this even if it is not needed. GROUP BY also is not strictly necessary, especially if there is a unique constraint (guaranteed uniqueness) for the tuple (product_id, option_id).

But, again, do not use this NOT IN (subquery) , with the exception of testing, unless there is some good reason (for example, it manages to work better than anti-connection).

You are unlikely to notice any performance differences with small sets, the overhead of submitting an application, parsing it, creating an access plan, and returning the results overshadows the actual execution time of the plan. It is with large sets that differences in "runtime" become apparent.

EXPLAIN SELECT ... is a really good way to get an idea of ​​execution plans to see what MySQL is really doing with your expression.

Corresponding indexes, especially covering indexes, can significantly improve the performance of some operators.

+17
source

Yes, you can make a LEFT JOIN (if MySQL, there are options in other dialects) that will contain lines in links that DO NOT have a match in the parameters. Then check if options.someColumn IS NULL , and you will have exactly the lines in the links that didn't have a matching line.

+4
source

Try something along the lines of this

To count

  SELECT Links.linkId, Count(*) FROM Link LEFT JOIN Options ON Links.optionId = Options.optionId Where Options.optionId IS NULL Group by Links.linkId 

To see the lines

 SELECT Links.linkId FROM Link LEFT JOIN Options ON Links.optionId = Options.optionId Where Options.optionId IS NULL 
+1
source

All Articles