Improve query efficiency for combining CTE + regular choices

Based on Convert the recursive function to view , I would like to speed up the extraction of the path from any arbitrary node on the graph to its root by creating a snapshot of the node's parents. The idea is that the recursive tree-like movement is limited by the presence of intermediate snapshots that exclude further recursion and therefore speed up execution time. I did not perform the load test, so I donโ€™t know how it works outside of this simple example, but early tests already point to some bottlenecks. I would be happy for comments on how to speed up the process / simplify queries. I am using Postgres 9.2.2.0 (20).

DROP TABLE IF EXISTS revision CASCADE; CREATE TABLE revision ( id serial primary key, parent_revision_id int references revision(id), username varchar(128), ts timestamp without time zone ); DROP TABLE IF EXISTS revision_snapshot CASCADE; CREATE TABLE revision_snapshot ( id serial primary key, revision_id int, parent_revision_id int, depth int ); CREATE OR REPLACE FUNCTION create_revision_snapshot(_rev int) RETURNS void AS $func$ DELETE FROM revision_snapshot WHERE revision_id=$1; INSERT INTO revision_snapshot (revision_id, parent_revision_id, depth) (SELECT $1, id, depth FROM revision_tree($1)); $func$ LANGUAGE sql; -- Recursively return path from '_rev' to root CREATE OR REPLACE FUNCTION revision_tree(_rev int) RETURNS TABLE(id int, parent_revision_id int, depth int) AS $func$ WITH RECURSIVE rev_list(id, parent_revision_id, depth) AS ( SELECT t.id, t.parent_revision_id, 1 FROM revision t WHERE t.id = $1 UNION ALL SELECT t.id, t.parent_revision_id, r.depth + 1 FROM rev_list r JOIN revision t ON t.id = r.parent_revision_id ) SELECT t.id, t.parent_revision_id, t.depth FROM rev_list t ORDER BY t.id; $func$ LANGUAGE sql; -- Fast version of 'revision_tree' (to be). This version will return the -- revision tree making use of snapshots (recursively returning the path from -- specified revision id to last snapshot of the path to the root + the snapshot) CREATE OR REPLACE FUNCTION revision_tree_perf(_rev int) RETURNS TABLE(parent_revision_id int) AS $func$ BEGIN CREATE TEMP TABLE graph_result ON COMMIT DROP AS WITH RECURSIVE rev_list(id, parent_revision_id, depth) AS ( SELECT t.id, t.parent_revision_id, 1 FROM revision t WHERE t.id = $1 UNION ALL SELECT t.id, t.parent_revision_id, r.depth + 1 FROM rev_list r JOIN revision t ON t.id = r.parent_revision_id WHERE not(t.id in (select revision_id from revision_snapshot)) ) SELECT t.id, t.parent_revision_id, t.depth FROM rev_list t ORDER BY t.id; RETURN QUERY SELECT g.parent_revision_id FROM graph_result AS g WHERE g.parent_revision_id IS NOT NULL UNION SELECT s.parent_revision_id FROM revision_snapshot AS s WHERE s.revision_id = (SELECT min(q.parent_revision_id) FROM graph_result as q) ORDER BY parent_revision_id; END; $func$ LANGUAGE 'plpgsql'; -- Example tree -- -- +-- <10> -- / -- +-- <4> -- <8> --- <9> -+- <11> --- <15> --- <16> --- <17> -- / -- <1> --- <2> --- <3> -+ -- \ -- +-- <5> --- <6> --- <7> --- <12> -+- <14> --- <18> -- \ -- \ -- \ -- \ -- +-- <13> --- <19> --- <20> --- <21> -- INSERT INTO revision (username, ts, parent_revision_id) VALUES ('someone', now(), null) -- 1 ,('someone', now(), 1) -- 2 ,('someone', now(), 2) -- 3 ,('someone', now(), 3) -- 4 ,('someone', now(), 3) -- 5 ,('someone', now(), 5) -- 6 ,('someone', now(), 6) -- 7 ,('someone', now(), 4) -- 8 ,('someone', now(), 8) -- 9 ,('someone', now(), 9) -- 10 ,('someone', now(), 9) -- 11 ,('someone', now(), 7) -- 12 ,('someone', now(), 12) -- 13 ,('someone', now(), 12) -- 14 ,('someone', now(), 11) -- 15 ,('someone', now(), 15) -- 16 ,('someone', now(), 16) -- 17 ,('someone', now(), 14) -- 18 ,('someone', now(), 13) -- 19 ,('someone', now(), 19) -- 20 ,('someone', now(), 20); -- 21 -- Create a revision snapsnot select create_revision_snapshot(13); -- This query is meant to be faster ... select * from revision_tree_perf(21); -- ... than this one select * from revision_tree(21); 

Above example

 select create_revision_snapshot(13); select * from revision_tree_perf(21); 

It is intended to create a record set that denotes a path from 21 to the root, i.e. (21, 20, 19, 13, 12, 7, 6, 5, 3, 2, 1) . Part of the solution was taken by walking through the tree (from 21 to 13, since there is a 13-shot, so there is no need to go further through the tree) and using the already โ€œcachedโ€ path from 13 to the root (taken from revision_snapshot). Hope this is a little easier to understand ...

Update:
I came up with a potential improvement. Itโ€™s just a blow in the dark, but I can imagine that the offers exists quite expensive. Now I have noted the presence of a snapshot in the revision table:

 CREATE TABLE revision ( id serial primary key, parent_revision_id int references revision(id), username varchar(128), has_snapshot boolean default false, ts timestamp without time zone ); CREATE OR REPLACE FUNCTION create_revision_snapshot(_rev int) RETURNS void AS $func$ DELETE FROM revision_snapshot WHERE revision_id=$1; INSERT INTO revision_snapshot (revision_id, parent_revision_id, depth) (SELECT $1, id, depth FROM revision_tree($1)); -- Mark revision table to avoid costly exists/in clause UPDATE revision SET has_snapshot = true WHERE id=$1; $func$ LANGUAGE sql; 

This changes the CTE revision_tree_perf SP part to

 WITH RECURSIVE rev_list(id, parent_revision_id, depth) AS ( SELECT t.id, t.parent_revision_id, 1 -- AS depth FROM revision t WHERE t.id = $1 UNION ALL SELECT t.id, t.parent_revision_id, r.depth + 1 FROM rev_list r JOIN revision t ON t.id = r.parent_revision_id WHERE t.has_snapshot = false ) SELECT t.id, t.parent_revision_id, t.depth FROM rev_list t ORDER BY t.id; 

This should be pretty fast. Another part of the puzzle will be to return the contents of the snapshot revision from the revision identifier, where has_snapshot=true , and combine the two results. The problem is how to get this revision ID from the CTE. I could save the result of the CTE query in the temp table and request the revision identifier, or perhaps it would be advisable not to write the CTE and write instead. Thus, it would be possible to track the revision identifier, according to which the loop will exit (when has_snapshot = true ). But I'm not sure how this will compare to CTE.

What do people think of this approach?

+7
source share
2 answers

Here is a fully redesigned version .
In my minimal test, the new function using revision_snapshot is actually faster. I have made numerous changes. The most important thing:

  • Do not add a column to the source table. This may speed up the query slightly, but it also introduces the cost and overhead in the main table. It can pay if everything you do with the table fulfills this function, but in real life this is only one of many tasks.

  • Delete the temporary table from the function. Can be made much cheaper only with CTE.

  • Fix ORDER BY what was wrong.

  • Read the comments in the code for more details.

Also like -> sqlfiddle to play with.

 CREATE TABLE revision ( revision_id serial PRIMARY KEY -- Don't use useless name "id", that an anti-pattern of ORMs ,parent_revision_id int NOT NULL REFERENCES revision(revision_id) DEFERRABLE -- must be DEFERRABLE for self-reference of root ,ts timestamp NOT NULL -- all columns NOT NULL ,has_snapshot boolean NOT NULL DEFAULT FALSE -- columns ordered for perfect packing and performance ,username text NOT NULL ); CREATE TABLE revision_snapshot ( depth int PRIMARY KEY ,revision_id int ); -- simplified -- Recursively return path from '_revision_id' to root CREATE OR REPLACE FUNCTION revision_tree(_revision_id int) RETURNS TABLE(depth int, revision_id int) AS $func$ WITH RECURSIVE l AS ( SELECT 1::int AS depth, r.parent_revision_id AS revision_id FROM revision r WHERE r.revision_id = $1 UNION ALL SELECT l.depth + 1, r.parent_revision_id -- AS revision_id FROM l JOIN revision r USING (revision_id) WHERE r.parent_revision_id <> 0 ) SELECT * FROM l ORDER BY l.depth; -- NOT revision_id! $func$ LANGUAGE sql; CREATE OR REPLACE FUNCTION create_revision_snapshot(_revision_id int) RETURNS void AS $func$ -- for tiny tables, DELETE is faster than TRUNCATE DELETE FROM revision_snapshot; INSERT INTO revision_snapshot (depth, revision_id) SELECT depth, revision_id FROM revision_tree($1); $func$ LANGUAGE sql; -- Faster version of 'revision_tree'. -- Stops recursion as soon as revision_snapshot covers the "last mile" to root CREATE OR REPLACE FUNCTION revision_tree_perf(_revision_id int) RETURNS TABLE(revision_id int) AS $func$ BEGIN RETURN QUERY -- works without expensive temp table WITH RECURSIVE l AS ( SELECT 1::int AS depth, r.parent_revision_id AS revision_id -- trim cruft, only two columns needed FROM revision r WHERE r.revision_id = $1 UNION ALL SELECT l.depth + 1, r.parent_revision_id -- AS revision_id FROM l JOIN revision r USING (revision_id) WHERE r.parent_revision_id <> 0 -- stop condition needed, since parent_revision_id IS NOT NULL AND NOT EXISTS ( -- NOT EXISTS faster than IN (SELECT...) SELECT 1 FROM revision_snapshot s WHERE s.revision_id = l.revision_id) ) ( -- extra parens needed for separate ORDER BY in UNION ALL SELECT l.revision_id FROM l ORDER BY l.depth -- NOT revision_id! Bug just didn't show because the test ids were ordered. ) UNION ALL -- NOT: UNION - correct and faster ( SELECT s.revision_id FROM revision_snapshot s WHERE s.depth > ( SELECT s0.depth FROM revision_snapshot s0 JOIN l USING (revision_id) ) -- must be exactly 1 value - follows logically from CTE ORDER BY s.depth ); END -- no ; needed here $func$ LANGUAGE plpgsql; -- DO NOT quote language name! -- Example tree -- -- +-- <10> -- / -- +- <4> -- <8> -- <9> -+- <11> -- <15> -- <16> -- <17> -- / -- <0> -- <1> -- <2> -- <3> -+ -- \ -- +- <5> -- <6> -- <7> -- <12> -+- <14> -- <18> -- \ -- \ -- \ -- \ -- +- <13> -- <19> -- <20> -- <21> -- INSERT INTO revision (revision_id, username, ts, parent_revision_id) VALUES (0, 'root', now(), 0) -- referencing itself ,(1, 'someone', now(), 0) ,(2, 'someone', now(), 1) ,(3, 'someone', now(), 2) ,(4, 'someone', now(), 3) ,(5, 'someone', now(), 3) ,(6, 'someone', now(), 5) ,(7, 'someone', now(), 6) ,(8, 'someone', now(), 4) ,(9, 'someone', now(), 8) ,(10,'someone', now(), 9) ,(11,'someone', now(), 9) ,(12,'someone', now(), 7) ,(13,'someone', now(), 12) ,(14,'someone', now(), 12) ,(15,'someone', now(), 11) ,(16,'someone', now(), 15) ,(17,'someone', now(), 16) ,(18,'someone', now(), 14) ,(19,'someone', now(), 13) ,(20,'someone', now(), 19) ,(21,'someone', now(), 20); ANALYZE revision; -- Create a revision snapsnot select create_revision_snapshot(13); ANALYZE revision_snapshot; 

Call:

Now this query should be faster:

 SELECT * FROM revision_tree_perf(21); 

.. than this one:

 SELECT * FROM revision_tree(21); 
0
source

[This is not an answer; Unfortunately, my naive view is much slower than the materialized version in OQ; -]

Here is a snippet to slightly increase the revision table population (after 21 VALUES have been inserted, obviously):

  -- clone the revision-table to make it larger INSERT INTO revision (username, ts, parent_revision_id) SELECT 'user' || src.id || 'clone' ||gs , src.ts+ gs*'1 min'::interval , src.parent_revision_id+ (21*gs) FROM revision src JOIN generate_series(1,10000) gs ON 1=1 ; -- SELECT * FROM revision; VACUUM ANALYZE revision; 

UPDATE:

I managed to create my own versions of functions that look somewhat faster than the originals (even if they are called before the originals)

 DROP FUNCTION fnc_naive(_me INTEGER); CREATE OR REPLACE FUNCTION fnc_naive(_me INTEGER) RETURNS TABLE(fixed int, this int, ancestor int, depth int) AS $body$ WITH RECURSIVE tree(fixed, this, some_ancestor, the_level) AS ( SELECT t.id AS fixed, t.id AS this , t.parent_revision_id AS some_ancestor, 1 AS the_level FROM revision t WHERE t.id = $1 UNION ALL SELECT tr.fixed AS fixed, rev.id AS this , rev.parent_revision_id AS some_ancestor , tr.the_level+1 AS the_level FROM tree tr JOIN revision rev ON rev.id = tr.some_ancestor ) SELECT t.fixed, t.this, t.some_ancestor, t.the_level FROM tree t ORDER BY t.this ; $body$ LANGUAGE sql; -- Altered "Fast" version of 'revision_tree' (to be). This version will return the -- revision tree making use of snapshots (recursively returning the path from -- specified revision id to last snapshot of the path to the root + the snapshot) -- Changes: -- replaced the IN(subselect) by a corresponding EXISTS (subselect) -- Replaced the = (select(min() subselect) by the corresponding NOT EXISTS CREATE OR REPLACE FUNCTION revision_tree_perf_wp(_rev int) RETURNS TABLE(parent_revision_id int) AS $$ BEGIN CREATE TEMP TABLE tmp_graph_result_wp ON COMMIT DROP AS WITH RECURSIVE rev_list(id, parent_revision_id, depth) AS ( SELECT t.id, t.parent_revision_id, 1 FROM revision t WHERE t.id = $1 UNION ALL SELECT t.id, t.parent_revision_id, r.depth + 1 FROM rev_list r JOIN revision t ON t.id = r.parent_revision_id WHERE NOT EXISTS (SELECT * FROM revision_snapshot nx WHERE t.id = nx.revision_id ) ) SELECT t.id, t.parent_revision_id, t.depth FROM rev_list t ; RETURN QUERY SELECT g.parent_revision_id FROM tmp_graph_result_wp AS g WHERE g.parent_revision_id IS NOT NULL UNION SELECT s.parent_revision_id FROM revision_snapshot AS s WHERE NOT EXISTS ( SELECT * FROM tmp_graph_result_wp nx WHERE nx.parent_revision_id < s.revision_id ) ORDER BY parent_revision_id ; END; $$ LANGUAGE 'plpgsql'; 

Additional notes: I do not understand the (intended) use of "revision_snapshot" and temp tables and tables "graph_result". It seems to me that a naive trio will always be executed, plus the additional โ€œexists in temp and / or cache tablesโ€. In any case: the naive version is much faster than the "cached" one.

0
source

All Articles