Alternative to SQL subquery

I have the following query:

SELECT DISTINCT e.id, folder, subject, in_reply_to, message_id, "references", e.updated_at, ( select count(*) from emails where ( select "references"[1] from emails where message_id = e.message_id ) = ANY ("references") or message_id = ( select "references"[1] from emails where message_id = e.message_id ) ) FROM "emails" e INNER JOIN "email_participants" ON ("email_participants"."email_id" = e."id") WHERE (("user_id" = 220) AND ("folder" = 'INBOX')) ORDER BY e."updated_at" DESC LIMIT 10 OFFSET 0; 

Here is to explain the analysis of the output of the above request.

The request was formed until I added the counter subquery below:

 ( select count(*) from emails where ( select "references"[1] from emails where message_id = e.message_id ) = ANY ("references") or message_id = ( select "references"[1] from emails where message_id = e.message_id ) ) 

In fact, I tried simpler subqueries and it seems the aggregate function itself takes time.

Is there an alternative way to add a subquery count to each result? Should I update the results after running the first query?

Here is a pastebin that will create a table and also run a poorly executed query at the end to show what the result should be.

+8
sql postgresql
source share
4 answers

Expanding Paul Guyot's answer, you can move the subquery to a view that should be faster because it retrieves the number of messages in one scan (plus the join), as opposed to 1 scan per line.

 SELECT DISTINCT e.id, e.folder, e.subject, in_reply_to, e.message_id, e."references", e.updated_at, t1.message_count FROM "emails" e INNER JOIN "email_participants" ON ("email_participants"."email_id" = e."id") INNER JOIN ( SELECT COUNT(e2.id) message_count, e.message_id FROM emails e LEFT JOIN emails e2 ON (ARRAY[e."references"[1]] <@ e2."references" OR e2.message_id = e."references"[1]) GROUP BY e.message_id ) t1 ON t1.message_id = e.message_id WHERE (("user_id" = 220) AND ("folder" = 'INBOX')) ORDER BY e."updated_at" DESC LIMIT 10 OFFSET 0; 

Script using pastebin data - http://www.sqlfiddle.com/#!15/c6298/7

Below are the postgres query plans for getting an invoice in a correlated subquery and getting an invoice by combining a view. I used one of my own tables, but I think the results should be similar.

Related Subquery

 "Limit (cost=0.00..1123641.81 rows=1000 width=8) (actual time=11.237..5395.237 rows=1000 loops=1)" " -> Seq Scan on visit v (cost=0.00..44996236.24 rows=40045 width=8) (actual time=11.236..5395.014 rows=1000 loops=1)" " SubPlan 1" " -> Aggregate (cost=1123.61..1123.62 rows=1 width=0) (actual time=5.393..5.393 rows=1 loops=1000)" " -> Seq Scan on visit v2 (cost=0.00..1073.56 rows=20018 width=0) (actual time=0.002..4.280 rows=21393 loops=1000)" " Filter: (company_id = v.company_id)" " Rows Removed by Filter: 18653" "Total runtime: 5395.369 ms" 

Merge a view

 "Limit (cost=1173.74..1211.81 rows=1000 width=12) (actual time=21.819..22.629 rows=1000 loops=1)" " -> Hash Join (cost=1173.74..2697.72 rows=40036 width=12) (actual time=21.817..22.465 rows=1000 loops=1)" " Hash Cond: (v.company_id = visit.company_id)" " -> Seq Scan on visit v (cost=0.00..973.45 rows=40045 width=8) (actual time=0.010..0.198 rows=1000 loops=1)" " -> Hash (cost=1173.71..1173.71 rows=2 width=12) (actual time=21.787..21.787 rows=2 loops=1)" " Buckets: 1024 Batches: 1 Memory Usage: 1kB" " -> HashAggregate (cost=1173.67..1173.69 rows=2 width=4) (actual time=21.783..21.784 rows=3 loops=1)" " -> Seq Scan on visit (cost=0.00..973.45 rows=40045 width=4) (actual time=0.003..6.695 rows=40046 loops=1)" "Total runtime: 22.806 ms" 
+3
source share

From what I understand in the semantics of your query, you can simplify:

 select count(*) from emails where ( select "references"[1] from emails where message_id = e.message_id ) = ANY ("references") or message_id = ( select "references"[1] from emails where message_id = e.message_id ) 

in

 select count(*) from emails where e."references"[1] = ANY ("references") OR message_id = e."references"[1] 

Indeed, message_id is not necessarily unique, but if you have different lines for the given message_id value, your request will not be executed.

This simplification, however, does not significantly change the cost of the request. In fact, the problem here is that you need two full table mail scans to complete the query (as well as an index scan on emails_message_id_index). You can save one full scan using the index in an array of links.

You would create an index like this with:

 CREATE INDEX emails_references_index ON emails USING GIN ("references"); 

An index alone helps to greatly simplify the initial query: provided there are up-to-date statistics, as with a sufficiently large number of rows, PostgreSQL will scan indexes. However, you should modify the subquery as follows to help the scheduler scan the raster index at that array index:

 select count(*) from emails where ARRAY[e."references"[1]] <@ "references" OR message_id = e."references"[1] 

The final request will read:

 SELECT DISTINCT e.id, folder, subject, in_reply_to, message_id, "references", e.updated_at, ( select count(*) from emails where ARRAY[e."references"[1]] <@ "references" OR message_id = e."references"[1] ) FROM "emails" e INNER JOIN "email_participants" ON ("email_participants"."email_id" = e."id") WHERE (("user_id" = 220) AND ("folder" = 'INBOX')) ORDER BY e."updated_at" DESC LIMIT 10 OFFSET 0; 

To illustrate the expected profit, some tests were conducted in a dummy environment:

  • with lines of 10,000 in tabular letters (and corresponding lines in the email_participants table), the initial query is executed in 787 ms, while the index scan is reduced to 399 ms, and the proposed query is executed in 12 ms;
  • with an initial row request of 100,000, it takes 9,200 ms, and the index scan is reduced to 4,251 ms, and the proposed query is performed at 637 ms.
+3
source share

It’s not easy to get this right without test data

 select e.id, folder, subject, in_reply_to, message_id, "references", e.updated_at, sum(the_count) as the_count from ( select *, ( "references"[1] = any ("references") or message_id = "references"[1] )::integer as the_count from emails ) e inner join email_participants on email_participants.email_id = e.id where user_id = 220 and folder = 'INBOX' group by 1, 2, 3, 4, 5, 6, 7 order by e.updated_at desc limit 10 offset 0; 

The reason your query is slow is because you are doing a table or index search for each row of your result set. This is called a correlated subquery.

group by 1, 2,... is just a short hand for column names in a select list.

A cast from a boolean to a whole gives 1 or 0.

+2
source share

I used your request in pastebin as a starting point. This differs from what is published here in that it is not included in the email_participants table.

I believe it could be that simple (or am I missing something?):

 SELECT e.id, e.folder, e.subject, e.message_id, e.references, e.updated_at, COUNT(e1.message_id) FROM emails e LEFT OUTER JOIN emails e1 ON e1.message_id = e.message_id AND (e1.references[1] = ANY (e.references) OR e1.references[1] = e.message_id) GROUP BY e.id, e.folder, e.subject, e.message_id, e.references, e.updated_at; 
0
source share

All Articles