General rules for simplifying SQL statements

I am looking for some “inference rules” (similar to given rules of work or logical rules) that I can use to reduce the SQL query in terms of complexity or size. Is there something similar? Any papers, any tools? What equivalents did you find yourself? This is somewhat similar to query optimization, but not in terms of performance.

To indicate something else: having a (complex) query with JOINs, SUBSELECTs, UNIONs, it is possible (or not) to reduce it to a simpler equivalent SQL statement that produces the same result using some conversion rules

So, I'm looking for equivalent transformations of SQL statements, for example, the fact that most SUBSELECTs can be rewritten as JOINs.

+62
complexity-theory sql logic reduction
Jul 01 '09 at 14:14
source share
8 answers

To indicate something else: having a (complex) query with JOINs, SUBSELECTs, UNIONs, it is possible (or not) to reduce it to a simpler equivalent SQL statement that produces the same result using some conversion rules

What optimizers do for a living (not what I say they always do it well).

Since SQL is a set-based language, there are usually several ways to convert one query to another.

Like this query:

 SELECT * FROM mytable WHERE col1 > @value1 OR col2 < @value2 

can convert to this:

 SELECT * FROM mytable WHERE col1 > @value1 UNION SELECT * FROM mytable WHERE col2 < @value2 

or that:

 SELECT mo.* FROM ( SELECT id FROM mytable WHERE col1 > @value1 UNION SELECT id FROM mytable WHERE col2 < @value2 ) mi JOIN mytable mo ON mo.id = mi.id 

which look uglier but can give better execution plans.

One of the most common things is replacing this request:

 SELECT * FROM mytable WHERE col IN ( SELECT othercol FROM othertable ) 

with this:

 SELECT * FROM mytable mo WHERE EXISTS ( SELECT NULL FROM othertable o WHERE o.othercol = mo.col ) 

In some RDBMS (for example, PostgreSQL ), DISTINCT and GROUP BY use different execution plans, so sometimes it is better to replace one with another:

 SELECT mo.grouper, ( SELECT SUM(col) FROM mytable mi WHERE mi.grouper = mo.grouper ) FROM ( SELECT DISTINCT grouper FROM mytable ) mo 

against.

 SELECT mo.grouper, SUM(col) FROM mytable GROUP BY mo.grouper 

In PostgreSQL , DISTINCT sorted and GROUP BY hashes.

MySQL missing a FULL OUTER JOIN , so it can be rewritten as follows:

 SELECT t1.col1, t2.col2 FROM table1 t1 LEFT OUTER JOIN table2 t2 ON t1.id = t2.id 

against.

 SELECT t1.col1, t2.col2 FROM table1 t1 LEFT JOIN table2 t2 ON t1.id = t2.id UNION ALL SELECT NULL, t2.col2 FROM table1 t1 RIGHT JOIN table2 t2 ON t1.id = t2.id WHERE t1.id IS NULL 

but see this blog post on how to do this more efficiently in MySQL :

This hierarchical query in Oracle :

 SELECT DISTINCT(animal_id) AS animal_id FROM animal START WITH animal_id = :id CONNECT BY PRIOR animal_id IN (father, mother) ORDER BY animal_id 

can convert to this:

 SELECT DISTINCT(animal_id) AS animal_id FROM ( SELECT 0 AS gender, animal_id, father AS parent FROM animal UNION ALL SELECT 1, animal_id, mother FROM animal ) START WITH animal_id = :id CONNECT BY parent = PRIOR animal_id ORDER BY animal_id 

the latter of which is more effective.

See this article on his blog for details of the implementation plan:

To find all ranges that overlap a given range, you can use the following query:

 SELECT * FROM ranges WHERE end_date >= @start AND start_date <= @end 

but in SQL Server this more complex query gives the same results faster:

 SELECT * FROM ranges WHERE (start_date > @start AND start_date <= @end) OR (@start BETWEEN start_date AND end_date) 

and believe it or not, I also have an article on my blog:

SQL Server also lacks an efficient way to create aggregate aggregates, so this query:

 SELECT mi.id, SUM(mo.value) AS running_sum FROM mytable mi JOIN mytable mo ON mo.id <= mi.id GROUP BY mi.id 

You can rewrite cursors more efficiently with, help me, cursors (you heard me right: cursors , more efficiently and SQL Server in one sentence).

Check out this blog post on how to do this:

There is a certain type of request that is commonly found in financial applications that are looking for an effective rate for a currency, for example, in Oracle :

 SELECT TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999') FROM t_transaction x JOIN t_rate r ON (rte_currency, rte_date) IN ( SELECT xac_currency, MAX(rte_date) FROM t_rate WHERE rte_currency = xac_currency AND rte_date <= xac_date ) 

This query can be heavily rewritten to use an equality condition that allows HASH JOIN instead of NESTED LOOPS :

 WITH v_rate AS ( SELECT cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate FROM ( SELECT cur_id, dte_date, ( SELECT MAX(rte_date) FROM t_rate ri WHERE rte_currency = cur_id AND rte_date <= dte_date ) AS rte_effdate FROM ( SELECT ( SELECT MAX(rte_date) FROM t_rate ) - level + 1 AS dte_date FROM dual CONNECT BY level <= ( SELECT MAX(rte_date) - MIN(rte_date) FROM t_rate ) ) v_date, ( SELECT 1 AS cur_id FROM dual UNION ALL SELECT 2 AS cur_id FROM dual ) v_currency ) v_eff LEFT JOIN t_rate ON rte_currency = cur_id AND rte_date = rte_effdate ) SELECT TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999') FROM ( SELECT xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt FROM t_transaction x GROUP BY xac_currency, TRUNC(xac_date) ) JOIN v_rate ON eff_currency = xac_currency AND eff_date = xac_date 

Despite being cumbersome, the last request is 6 times faster.

The main idea here is to replace <= with = , which requires building a calendar table in memory. up to JOIN s.

+58
Jul 01 '09 at 14:17
source share

Here's some of working with Oracle 8 and 9 (of course, sometimes doing the opposite can make the query easier or faster):

Brackets can be removed if they are not used to override operator precedence. A simple example is that all the logical operators in your where clause are the same: where ((a or b) or c) equivalent to where a or b or c .

A subprocess often (if not always) can be combined with the main request to simplify it. In my experience, this often significantly improves performance:

 select foo.a, bar.a from foomatic foo, bartastic bar where foo.id = bar.id and bar.id = ( select ban.id from bantabulous ban where ban.bandana = 42 ) ; 

equivalently

 select foo.a, bar.a from foomatic foo, bartastic bar, bantabulous ban where foo.id = bar.id and bar.id = ban.id and ban.bandana = 42 ; 

Using ANSI connections separates a lot of the "code monkey" logic from the really interesting parts of the where clause: the previous request is equivalent

 select foo.a, bar.a from foomatic foo join bartastic bar on bar.id = foo.id join bantabulous ban on ban.id = bar.id where ban.bandana = 42 ; 

If you want to check for the presence of a row, do not use count (*), instead use either rownum = 1 , or put the query in the where exists clause to extract only one row instead of all.

+9
Jul 01 '09 at 15:03
source share
  • I believe it is obvious to look for any cursors that can be replaced with the SQL "Set" operation.
  • Next on my list, find any correlated subqueries that can be rewritten as an uncorrelated query
  • In long stored procedures, deploy individual SQL statements into your own stored procedures. This way they get their own query caching plan.
  • Look at transactions that can shorten the scope. I regularly find statements inside a transaction that can safely be outside.
  • Subsamples can often be rewritten as a direct connection (modern optimizers are well versed in simple ones).

As @Quassnoi noted, the optimizer often does a good job. One way to help him is to update indexes and statistics and find suitable indexes for the workload of your query.

+5
Jul 01 '09 at 14:19
source share

I like replacing all the subtasks with a connection request.

It is obvious:

 SELECT * FROM mytable mo WHERE EXISTS ( SELECT * FROM othertable o WHERE o.othercol = mo.col ) 

by

 SELECT mo.* FROM mytable mo inner join othertable o on o.othercol = mo.col 

And this is under evaluation:

 SELECT * FROM mytable mo WHERE NOT EXISTS ( SELECT * FROM othertable o WHERE o.othercol = mo.col ) 

by

 SELECT mo.* FROM mytable mo left outer join othertable o on o.othercol = mo.col WHERE o.othercol is null 

This can help the DBMS choose a good execution plan in a large query.

+5
Jul 01 '09 at 14:33
source share

I like everyone in the team to follow a number of standards to make the code readable, maintainable, understandable, washable, etc. :)

  • everyone uses the same alias
  • no cursors. no cycles
  • why even think about IN when you can EXISTS
  • INDENT
  • Coding Style Consistency

there are a few more things here. What are some of your most useful database standards?

+5
Jul 01 '09 at 14:53
source share

Given the nature of SQL, you absolutely must be aware of the performance implications of any refactoring. Refactoring SQL applications is a good refactoring resource with a lot of focus on performance (see Chapter 5).

+4
Jul 01 '09 at 15:58
source share

Although simplification may not coincide with optimization, simplification may be important when writing readable SQL code, which in turn is critical to validating your SQL code for conceptual correctness (rather than syntactic correctness, which your development environment should check for you), It seems to me that in an ideal world we would write the simplest, most readable SQL code, and then the optimizer would rewrite that SQL code, which will be in any form (possibly more detailed), will work faster.

I found that thinking of SQL statements based on a given logic is very useful, especially if I need to combine sentences or figure out the complex negation of the where clause. In this case, I use the laws of Boolean algebra .

The most important for simplifying the where clause are probably DeMorgan laws (note that "·" is "AND" and "+" is "OR"):

  • NOT (xy) = NOT x + NOT y
  • NOT (x + y) = NOT xNOT y

This means that SQL:

 NOT (expr1 AND expr2) -> NOT expr1 OR NOT expr2 NOT (expr1 OR expr2) -> NOT expr1 AND NOT expr2 

These laws can be very useful to simplify where where with many nested parts of AND and OR .

It is also useful to remember that the operator field1 IN (value1, value2, ...) equivalent to field1 = value1 OR field1 = value2 OR ... This allows you to undo IN () one of two ways:

 NOT field1 IN (value1, value2) -- for longer lists NOT field1 = value1 AND NOT field1 = value2 -- for shorter lists 

Similarly, a subquery may be considered. For example, this is a negation of where where:

 NOT (table1.field1 = value1 AND EXISTS (SELECT * FROM table2 WHERE table1.field1 = table2.field2)) 

can be rewritten as:

 NOT table1.field1 = value1 OR NOT EXISTS (SELECT * FROM table2 WHERE table1.field1 = table2.field2)) 

These laws do not tell you how to convert a SQL query using a subquery to one using a join, but logical logic can help you understand the types of connections and what your query should return. For example, with tables A and B , INNER JOIN is similar to A AND B , a LEFT OUTER JOIN is similar to (A AND NOT B) OR (A AND B) , which is simplified to A OR (A AND B) , and FULL OUTER JOIN - A OR (A AND B) OR B , which simplifies to A OR B

+3
Mar 13 '12 at 15:52
source share

My approach is to study relational theory in general and relational algebra in particular. Then learn how to define the constructs used in SQL to implement operators from relational algebra (e.g. universal aka quantization) and calculus (e.g. existential quantification). The end result is that SQL has functions not found in the relational model, for example. The nulls that are probably best reorganized. Recommended reading: SQL and relational theory: how to write accurate SQL code using CJ Date .

In this vein, I'm not sure that “the fact that most SUBSELECTs can be rewritten as JOINs” is a simplification.

Take this query for example:

 SELECT c FROM T1 WHERE c NOT IN ( SELECT c FROM T2 ); 

Rewrite with JOIN

 SELECT DISTINCT T1.c FROM T1 NATURAL LEFT OUTER JOIN T2 WHERE T2.c IS NULL; 

The connection is more detailed!

Alternatively, recognize that the construction implements an antiion on projection c , for example. pseudo algrbra

 T1 { c } antijoin T2 { c } 

Simplification using relational operators:

 SELECT c FROM T1 EXCEPT SELECT c FROM T2; 
0
Mar 13 '12 at 16:33
source share



All Articles