PostgreSQL window function: section compared

Question

PostgreSQL window function: section compared

I am trying to find a way to compare with the current row in the PARTITION BY clause in the WINDOW function in a PostgreSQL query.

Imagine that I have a short list in the next query of these 5 elements (in the real case, I have thousands or even millions of lines). I am trying to get, for each row, the identifier of the next other element (event column) and the identifier of the previous other element.

WITH events AS( SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date ) SELECT lag(id) over w as previous_different, event , lead(id) over w as next_different FROM events ev WINDOW w AS (PARTITION BY event!=ev.event ORDER BY date ASC);

I know that comparing event!=ev.event is wrong, but what I want to achieve.

The result you get (the same as if I deleted the PARTITION BY clause):

  |12|2 1|12|3 2|13|4 3|13|5 4|12|

And I want to get the result:

  |12|3 |12|3 2|13|5 2|13|5 4|12|

Does anyone know if this is possible and how? Thank you very much!

EDIT: I know I can do this with two JOIN s, a ORDER BY and a DISTINCT ON , but in the real case of millions of rows, this is very inefficient:

 WITH events AS( SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date ) SELECT DISTINCT ON (e.id, e.date) e1.id, e.event, e2.id FROM events e LEFT JOIN events e1 ON (e1.date<=e.date AND e1.id!=e.id AND e1.event!=e.event) LEFT JOIN events e2 ON (e2.date>=e.date AND e2.id!=e.id AND e2.event!=e.event) ORDER BY e.date ASC, e.id ASC, e1.date DESC, e1.id DESC, e2.date ASC, e2.id ASC

+3

sql postgresql postgresql-performance window-functions

Aleix Mar 19 '14 at 16:32

source share

1 answer

Erwin brandstetter · Accepted Answer · 2014-03-20T01:08:32+0000

Using several different window functions and two subqueries, this should work fast enough:

 WITH events(id, event, ts) AS ( VALUES (1, 12, '2014-03-19 08:00:00'::timestamp) ,(2, 12, '2014-03-19 08:30:00') ,(3, 13, '2014-03-19 09:00:00') ,(4, 13, '2014-03-19 09:30:00') ,(5, 12, '2014-03-19 10:00:00') ) SELECT first_value(pre_id) OVER (PARTITION BY grp ORDER BY ts) AS pre_id , id, ts , first_value(post_id) OVER (PARTITION BY grp ORDER BY ts DESC) AS post_id FROM ( SELECT *, count(step) OVER w AS grp FROM ( SELECT id, ts , NULLIF(lag(event) OVER w, event) AS step , lag(id) OVER w AS pre_id , lead(id) OVER w AS post_id FROM events WINDOW w AS (ORDER BY ts) ) sub1 WINDOW w AS (ORDER BY ts) ) sub2 ORDER BY ts;

Use ts as the column name of the timestamp.
Assuming ts is unique - and indexed (a unique constraint does this automatically).

In a test with a real life table with 50 thousand rows, he needed only one index scan. So, you need to decently quickly even with large tables. For comparison, your join / distinct request did not complete in a minute (as expected).
Even the optimized version dealing with one cross-connection at a time (a left connection with hardly an extreme condition is actually a limited cross-connection) did not end in a minute.

For best performance with a large table, tune the memory parameters, in particular for work_mem (for large sorting operations). Consider setting it (much) higher for your session temporarily, if you can save RAM. More details here and here .

How?

In the subquery sub1 view the event from the previous line and save it only if it has changed, thereby marking the first element of the new group. At the same time, get the id previous and next lines ( pre_id , post_id ).
In subquery sub2 count() only nonzero values are taken into account. Received grp peer marks in blocks of consecutive identical events.
In the final SELECT take the first pre_id and last post_id for the group for each row to achieve the desired result.
In fact, this should be even faster in an external SELECT :
```
  last_value(post_id) OVER (PARTITION BY grp ORDER BY ts RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS post_id 
```
... since the sort order of the window is the same as the window for pre_id , so only one view is needed. A quick check seems to confirm this. More on this frame definition.

SQL Fiddle

PostgreSQL window function: section compared

How?

More articles: