Confusing performance enhancement with nested select * does not exist

The two tables that I request from both have ~ 150 million rows.

The following statement, which I finish after it does not return within 45 minutes, so I do not know how long it will work:

select * from Cats cat where not exists( select dog.foo,dog.bar from Dogs dog where cat.foo = dog.foo and cat.bar = dog.bar); 

however, this request runs after about 3 minutes:

 select * from Cats outside where not exists(select * from Cats cat where exists( select dog.foo,dog.bar from Dogs dog where cat.foo = dog.foo and cat.bar = dog.bar))); 

My question is what is going on behind the scenes, what I see is that this is an increase in productivity?

Reasoning for the return of the same set of results:

The first query (slow) state indicates all elements that do not exist based on the Cats table.

The second query (fast) contains all the elements that do not exist from the subset of Cats that exist.

I expect the following request:

 select dog.foo,dog.bar from Dogs dog where cat.foo = dog.foo and cat.bar = dog.bar 

to return [A, B, C]

This is common to both functions.

My cat table has the following: [A, B, C, D, E]

I expect the following request:

  select * from Cats cat where exists 

to return [A, B, C] and the last fragment:

 select * from Cats outside where not exists 

to return [D, E]

UPDATE

Set the notation to mathematically prove my claims (please correct me if I used the wrong characters):

 βˆ€ Cat (Ǝ cat β‰  Ǝdog) 

For all elements in Cat, return a collection containing each cat element that is not equal to the element in dog

 βˆ€ Cat (Ǝ cat = Ǝdog) 

For all elements in Cat, return a collection containing each cat element that is equal to the element in dog

 βˆ€ Cat (Ǝ innerCat β‰  Ǝcat) 

For all elements in Cat, return a set containing each element of the inner cat that is not equal to the element in cat

Second update

I see that my math did not match my SQL.

+4
source share
5 answers

I found through testing that this is the most efficient way to execute queries in the initial question:

 Select cat.foo,cat.bar from cats cat MINUS Select dog.foo,dog.bar from dogs dog 

This works because none of my columns matter.

0
source

NOT IN and NOT EXISTS seem to be problematic for optimizing data engines. Technically, they are called anti-joins (unlike equi-joins, semi-joins, non-equijoins, etc.).

When combining is difficult to optimize, engines resort to combining nested loops. This is usually the worst kind of execution (although SQL Server execution plans often look the same because SQL Server invokes an indexed β€œloopback” search in the execution plan).

What is the difference between these two queries? The first has only NON-EXISTING, so it probably does something ineffective. The second executes EXISTS in the internal majority of subqueries. At first it is optimized mainly as a connection. If the keys have indexes, all is well. SQL Server can also choose hash or merge algorithms.

"Does not exist" in the second version is based on the same table. This can give SQL Server more options for optimization.

Finally, the second version can significantly reduce the data set. If so, then even a nested loop join from the outside can go much faster.

+3
source

The second request is much more optimal when executed, and that is why:

You are referencing the Cats external query table on outside , but you are not referencing outside in your where not exists clause does where not exists . Therefore, SQL can do the following:

  • find one cat where cat.foo = dog.foo and cat.bar = dog.bar (from your innermost request)
  • this means that there is a cat that matches your external where not exists for all cats outside
  • so the where not exists clause is false for all lines in outside
  • therefore the query result is empty

Your first query should re-execute the nested query for each cat in the table and therefore slower.

+2
source

The answer to your question will be to check implementation plans.

As a side element, you should try this equivalent query (see also fooobar.com/questions/88131 / ... ):

 SELECT * FROM Cats cat LEFT OUTER JOIN Dogs dog ON cat.foo = dog.foo and cat.bar = dog.bar WHERE dog.foo IS NULL and dog.bar IS NULL 

I bet that it will follow the path faster (if you have the necessary indexes).

+1
source

They are different queries with different results . In order for the second to return as the first, it must be something like ...

  select * from cats outside where not exists(select * from Cats cat where exists( select dog.foo,dog.bar from Dogs dog where cat.foo = dog.foo and cat.bar = dog.bar) and outside.foo = cat.foo and outside.bar=cat.bar ) 
-1
source

All Articles