Postgres time series queries

Question

Postgres time series queries

This is the answer to the question from @Erwin's answer to Postgres Efficient Time Series Query .

To keep things simple, I will use the same table structure as this question

id | widget_id | for_date | score |

The initial question was to get an estimate for each of the widgets for each date in the range. If there was no entry for the widget on the date, then display the rating from the previous entry for this widget. A solution using cross-connect and window function worked well if all the data was in the range that you requested. My problem is that I want to get the previous result, even if it is outside the date range that we are looking at.

Sample data:

 INSERT INTO score (id, widget_id, for_date, score) values (1, 1337, '2012-04-07', 52), (2, 2222, '2012-05-05', 99), (3, 1337, '2012-05-07', 112), (4, 2222, '2012-05-07', 101);

When I request a range from May 5 to May 10, 2012 (i.e. generate_series('2012-05-05'::date, '2012-05-10'::date, '1d') ), I would like get the following:

 DAY WIDGET_ID SCORE May, 05 2012 1337 52 May, 05 2012 2222 99 May, 06 2012 1337 52 May, 06 2012 2222 99 May, 07 2012 1337 112 May, 07 2012 2222 101 May, 08 2012 1337 112 May, 08 2012 2222 101 May, 09 2012 1337 112 May, 09 2012 2222 101 May, 10 2012 1337 112 May, 10 2012 2222 101

The best solution so far (also by @Erwin):

 SELECT a.day, a.widget_id, s.score FROM ( SELECT d.day, w.widget_id ,max(s.for_date) OVER (PARTITION BY w.widget_id ORDER BY d.day) AS effective_date FROM (SELECT generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date AS day) d CROSS JOIN (SELECT DISTINCT widget_id FROM score) AS w LEFT JOIN score s ON s.for_date = d.day AND s.widget_id = w.widget_id ) a LEFT JOIN score s ON s.for_date = a.effective_date AND s.widget_id = a.widget_id ORDER BY a.day, a.widget_id;

But as you can see in this SQL Fiddle , it produces zero points for widget 1337 in the first two days. I would like to see an earlier score of 52 from row 1 in my place.

Can this be done in an effective way?

+3

sql greatest-n-per-group postgresql time-series generate-series

bpaul Oct 18 '13 at 5:38

source share

3 answers

As you wrote, you must find the corresponding score, but if there is a gap, fill it with the nearest early account. In SQL, it will be:

 SELECT d.day, w.widget_id, coalesce(s.score, (select s2.score from score s2 where s2.for_date<d.day and s2.widget_id=w.widget_id order by s2.for_date desc limit 1)) as score from (select distinct widget_id FROM score) AS w cross join (SELECT generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date AS day) d left join score s ON (s.for_date = d.day AND s.widget_id = w.widget_id) order by d.day, w.widget_id;

Coalesce in this case means "if there is a space".

+1

Tomasz myrta Oct 18 '13 at 6:25

source share

You can use the distinct on syntax in PostgreSQL

 with cte_d as ( select generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date as day ), cte_w as ( select distinct widget_id from score ) select distinct on (d.day, w.widget_id) d.day, w.widget_id, s.score from cte_d as d cross join cte_w as w left outer join score as s on s.widget_id = w.widget_id and s.for_date <= d.day order by d.day, w.widget_id, s.for_date desc;

or get the maximum date for a subquery:

 with cte_d as ( select generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date as day ), cte_w as ( select distinct widget_id from score ) select d.day, w.widget_id, s.score from cte_d as d cross join cte_w as w left outer join score as s on s.widget_id = w.widget_id where exists ( select 1 from score as tt where tt.widget_id = w.widget_id and tt.for_date <= d.day having max(tt.for_date) = s.for_date ) order by d.day, w.widget_id;

Performance really depends on the indexes you have on your table (possibly unique widget_id, for_date ). I think that if you have many rows for each widget_id , then the second will be more efficient, but you should check it on your data.

→ sql demo <

+1

Roman pekar Oct 18 '13 at 6:50

source share

Erwin brandstetter · Accepted Answer · 2013-10-18T15:00:35+0000

As @Roman noted , DISTINCT ON can solve this problem. Details in this related answer:

Select the first row in each GROUP BY?

Subqueries are usually slightly faster than CTEs, but:

 SELECT DISTINCT ON (d.day, w.widget_id) d.day, w.widget_id, s.score FROM generate_series('2012-05-05'::date, '2012-05-10'::date, '1d') d(day) CROSS JOIN (SELECT DISTINCT widget_id FROM score) AS w LEFT JOIN score s ON s.widget_id = w.widget_id AND s.for_date <= d.day ORDER BY d.day, w.widget_id, s.for_date DESC;

You can use a return set function such as a table in the FROM list.

SQL Fiddle

One multi-column index should be the key to performance:

 CREATE INDEX score_multi_idx ON score (widget_id, for_date, score)

The third score column is included to make it a covering index in Postgres 9.2 or later . You would not include it in earlier versions.

Of course, if you have many widgets and a wide range of days, CROSS JOIN creates many rows that have a price tag. Select only the widgets and days that you really need.

Postgres time series queries

More articles: