Window functions or general table expressions: count previous rows within a range

I would like to use the window function to determine for each row the total number of previous records that meet certain criteria.

Specific example:

clone=# \d test Table "pg_temp_2.test" Column | Type | Modifiers --------+-----------------------------+----------- id | bigint | date | timestamp without time zone | 

I would like to know for each date number of rows within "1 hour ago" before that date .

Can I do this using the window function? Or do I need to research CTE?

I really want to write something like (doesn't work):

 SELECT id, date, count(*) OVER (HAVING previous_rows.date >= (date - '1 hour'::interval)) FROM test; 

I can write this by joining the test against myself, as shown below, but it will not scale with particularly large tables.

 SELECT a.id, a.date, count(b.*)-1 FROM test a, test b WHERE (b.date >= a.date - '1 hour'::interval AND b.date < a.date) GROUP BY 1,2 ORDER BY 2; 

Can I do something with a recursive query? Or a regular CTE? CTE is not something that I still know much about. I have a feeling that I will go soon. :)

+4
source share
2 answers

I donโ€™t think you can do it cheaply with a simple query, CTE and window functions - their frame definition is static, but you need a dynamic frame .

Typically, you will need to carefully determine the lower and upper bounds of your window: The following queries exclude the current row and include the lower bound.
There is still a slight difference: the function includes previous peers of the current row, while the correlated subquery excludes them ...

Test case

Using ts instead of the reserved word date as the column name.

 CREATE TEMP TABLE test ( id bigint ,ts timestamp ); 

ROM - Roman request

Use CTE, aggregate timestamps in an array, ignore, count ...
That's right, performance deteriorates dramatically with a lot of rows filled with rows. There are several killers here. See below.

ARR - elements of the count array

I took a Roman request and tried to simplify it a bit:
- Remove the second CTE that is not needed.
- Convert 1st CTE to a subquery, which is faster.
- Direct count() instead of re-aggregating to an array and counting using array_length() .

But processing arrays is expensive, and performance is still poorly deteriorated with a lot of lines.

 SELECT id, ts ,(SELECT count(*)::int - 1 FROM unnest(dates) x WHERE x >= sub.ts - interval '1h') AS ct FROM ( SELECT id, ts ,array_agg(ts) OVER(ORDER BY ts) AS dates FROM test ) sub; 

COR - correlated subquery

You can solve this with a simple and ugly correlated subquery. Much faster, but still ...

 SELECT id, ts ,(SELECT count(*) FROM test t1 WHERE t1.ts >= t.ts - interval '1h' AND t1.ts < t.ts) AS ct FROM test t ORDER BY ts; 

FNC - Function

Iterate over the rows in chronological order using row_number() in the plpgsql function and combine this with the cursor on the same query, covering the required time interval. Then we can simply subtract the number of rows. It should be done beautifully.

 CREATE OR REPLACE FUNCTION running_window_ct() RETURNS TABLE (id bigint, ts timestamp, ct int) AS $func$ DECLARE i CONSTANT interval = '1h'; -- given interval for time frame cur CURSOR FOR SELECT t.ts + i AS ts1 -- incremented by given interval , row_number() OVER (ORDER BY t.ts) AS rn FROM test t ORDER BY t.ts; -- in chronological order rec record; -- for current row from cursor rn int; BEGIN OPEN cur; FETCH cur INTO rec; -- open cursor, fetch first row ct := -1; -- init; -1 covers special case at start FOR id, ts, rn IN SELECT t.id, t.ts, row_number() OVER (ORDER BY t.ts) FROM test t ORDER BY t.ts -- in same chronological order as cursor LOOP IF rec.ts1 >= ts THEN -- still in range ... ct := ct + 1; -- ... just increment ELSE -- out of range ... LOOP -- ... advance cursor FETCH cur INTO rec; EXIT WHEN rec.ts1 >= ts; -- earliest row within time frame END LOOP; ct := rn - rec.rn; -- new count END IF; RETURN NEXT; END LOOP; END $func$ LANGUAGE plpgsql; 

Call:

 SELECT * FROM running_window_ct(); 

SQL Fiddle

Benchmark

In the above table, I did a quick test on my old test server: PostgreSQL 9.1.9 on Debian).

 -- TRUNCATE test; INSERT INTO test SELECT g, '2013-08-08'::timestamp + g * interval '5 min' + random() * 300 * interval '1 min' -- halfway realistic values FROM generate_series(1, 10000 ) g; CREATE INDEX test_ts_idx ON test (ts); ANALYZE test; -- temp table needs manual analyze 

I changed the bold part for each run and took a maximum of 5 with EXPLAIN ANALYZE .

100 lines
ROM: 27.656 ms
ARR: 7.834 ms
COR: 5,488 ms
FNC: 1.115 ms

1000 lines
ROM: 2116.029 ms
ARR: 189.679 ms
COR: 65.802 ms
FNC: 8.466 ms

5000 lines
ROM: 51347 ms !!
ARR: 3167 ms
COR: 333 ms
FNC: 42 ms

100,000 lines
ROM: DNF
ARR: DNF
COR: 6760 ms
FNC: 828 ms

Function is the clear winner. This is fastest and an order of magnitude better. Array processing cannot compete.

+5
source

update My previous attempt does not work well, because it combines all the elements into an array, and this is not what I wanted to do. So, here is the updated version - it does not work as well as an independent connection or function with cursors, but it is not as scary as my previous one:

 CREATE OR REPLACE FUNCTION agg_array_range_func ( accum anyarray, el_cur anyelement, el_start anyelement, el_end anyelement ) returns anyarray as $func$ declare i int; N int; begin N := array_length(accum, 1); i := 1; if N = 0 then return array[el_cur]; end if; while i <= N loop if accum[i] between el_start and el_end then exit; end if; i := i + 1; end loop; return accum[i:N] || el_cur; end; $func$ LANGUAGE plpgsql; CREATE AGGREGATE agg_array_range ( anyelement, anyelement, anyelement ) ( SFUNC=agg_array_range_func, STYPE=anyarray ); select id, ts, array_length( agg_array_range(ts, ts - interval '1 hour', ts) over (order by ts) , 1) - 1 from test; 

I tested on my local machine and in sqlfiddle, and actually the self-connection performed best (I was surprised my results do not match Erwin), then the Erwin function, and then this aggregate. You can check it yourself in sqlfiddle

previous I'm still learning PostgreSQL, but I really like all the features. If it were SQL Server, I would use select for xml and select from xml. I donโ€™t know how to do it in PostreSQL, but there are much better things for this task - arrays !!!
So here is my CTE with window functions (I think it will not work correctly if the dates are repeated in the table, and I also donโ€™t know if this will work better than self-connection):

 with cte1 as ( select id, ts, array_agg(ts) over(order by ts asc) as dates from test ), cte2 as ( select c.id, c.ts, array( select arr from (select unnest(dates) as arr) as x where x.arr >= c.ts - '1 hour'::interval ) as dates from cte1 as c ) select c.id, c.ts, array_length(c.dates, 1) - 1 as cnt from cte2 as c 

see sql script demo

hope that helps

+2
source

All Articles