Optimize a query with a huge NOT IN expression

I am trying to find sources that ONLY exist before a certain timestamp. This query seems very poor for work. Any idea of ​​an optimization or index that can improve?

select distinct sourcesite from contentmeta where timestamp <= '2011-03-15' and sourcesite not in ( select distinct sourcesite from contentmeta where timestamp>'2011-03-15' ); 

There is a pointer to sourceite and timestamp, but the request still takes a lot of time

 mysql> EXPLAIN select distinct sourcesite from contentmeta where timestamp <= '2011-03-15' and sourcesite not in (select distinct sourcesite from contentmeta where timestamp>'2011-03-15'); +----+--------------------+-------------+----------------+---------------+----------+---------+------+--------+-------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+-------------+----------------+---------------+----------+---------+------+--------+-------------------------------------------------+ | 1 | PRIMARY | contentmeta | index | NULL | sitetime | 14 | NULL | 725697 | Using where; Using index | | 2 | DEPENDENT SUBQUERY | contentmeta | index_subquery | sitetime | sitetime | 5 | func | 48 | Using index; Using where; Full scan on NULL key | +----+--------------------+-------------+----------------+---------------+----------+---------+------+--------+-------------------------------------------------+ 
+4
source share
3 answers

The subquery does not need DISTINCT, and the WHERE clause for the outer query is also not required, since you are already filtering NOT IN.

Try:

 select distinct sourcesite from contentmeta where sourcesite not in ( select sourcesite from contentmeta where timestamp > '2011-03-15' ); 
+3
source

This should work:

 SELECT DISTINCT c1.sourcesite FROM contentmeta c1 LEFT JOIN contentmeta c2 ON c2.sourcesite = c1.sourcesite AND c2.timestamp > '2011-03-15' WHERE c1.timestamp <= '2011-03-15' AND c2.sourcesite IS NULL 

For optimal performance, use a multi-column index for contentmeta ( sourcesite , timestamp ).

As a rule, joins work better than subqueries because views cannot use indexes.

+3
source

I find that "not in" is just not very well optimized in many databases. Use left outer join instead:

 select distinct sourcesite from contentmeta cm left outer join ( select distinct sourcesite from contentmeta where timestamp>'2011-03-15' ) t on cm.sourcesite = t.sourcesite where timestamp <= '2011-03-15' and t.sourcesite is null 

This suggests that sourcesite never null.

+1
source

Source: https://habr.com/ru/post/1411606/


All Articles