Effectively access a huge time series table on a single line every 15 minutes

Question

Effectively access a huge time series table on a single line every 15 minutes

I have two tables: conttagtable (t) and contfloattable (cf). T has about 43 thousand lines. CF has over 9 billion.

I created an index for both tables in the tagindex column for both tables. This column can be considered as a unique identifier for conttagtable and as a foreign key in conttagtable for confloattable . I did not explicitly create a PK or foreign key in any table belonging to another, although this data is logically connected by the tagindex column in both tables, as if conttagtable.tagindex were PRIMARY KEY and contfloattable.tagindex , where a FOREIGN KEY (tagindex) REFERENCES conttagtable(tagindex) . The data came from a dump of access to Microsoft and I did not know if I could trust tagindex to be unique, so "uniqueness" does not apply.

The data itself is extremely large.

I need to get one randomly selected row from contfloattable for each 15 minute interval contfloattable.dateandtime for each conttagtable.tagid . So, if the contfloattable for a given tagid has 4000 samples spanning 30 minutes, I need a sample from the range 0-14 minutes and a sample from the range 15-30 minutes. Any sample within 15 minutes is valid; 1st, last, random, any.

In short, I need to get a sample every 15 minutes, but only one sample at t.tagname. Samples are being recorded every 5 seconds right now, and the data spans two years. This is a big data problem and my point of view in terms of sql. All the time interval solutions that I tried from searching or searching in SO gave a query time that is so long that they are not practical.

Are my indexes sufficient for quick joins? (they seem to leave part of the time interval)
Can I benefit from adding other indexes?
What is the best / fastest request that achieves the above goals?

Here's an SQLFiddle containing the schema and some sample data: http://sqlfiddle.com/#!1/c7d2f/2

Scheme:

  Table "public.conttagtable" (t) Column | Type | Modifiers -------------+---------+----------- tagname | text | tagindex | integer | tagtype | integer | tagdatatype | integer | Indexes: "tagindex" btree (tagindex) Table "public.contfloattable" (CF) Column | Type | Modifiers -------------+-----------------------------+----------- dateandtime | timestamp without time zone | millitm | integer | tagindex | integer | Val | double precision | status | text | marker | text | Indexes: "tagindex_contfloat" btree (tagindex)

The result that I would like to see looks something like this:

 cf.dateandtime |cf."Val"|cf.status|t.tagname -------------------------------------------------- 2012-11-16 00:00:02 45 S SuperAlpha 2012-11-16 00:00:02 45 S SuperBeta 2012-11-16 00:00:02 45 S SuperGamma 2012-11-16 00:00:02 45 S SuperDelta 2012-11-16 00:15:02 45 S SuperAlpha 2012-11-16 00:15:02 45 S SuperBeta 2012-11-16 00:15:02 45 S SuperGamma 2012-11-16 00:15:02 45 S SuperDelta 2012-11-16 00:30:02 45 S SuperAlpha 2012-11-16 00:30:02 45 S SuperBeta 2012-11-16 00:30:02 45 S SuperGamma 2012-11-16 00:30:02 45 S SuperDelta 2012-11-16 00:45:02 42 S SuperAlpha

... etc. etc.

As suggested by Clodoaldo, is this my last attempt, any suggestions for speeding it up?

 with i as ( select cf.tagindex, min(dateandtime) dateandtime from contfloattable cf group by floor(extract(epoch from dateandtime) / 60 / 15), cf.tagindex ) select cf.dateandtime, cf."Val", cf.status, t.tagname from contfloattable cf inner join conttagtable t on cf.tagindex = t.tagindex inner join i on i.tagindex = cf.tagindex and i.dateandtime = cf.dateandtime order by floor(extract(epoch from cf.dateandtime) / 60 / 15), cf.tagindex

Request plan from the above: http://explain.depesz.com/s/loR

+3

sql group-by postgresql aggregate postgresql-performance

Sneaky wombat Oct 16 '12 at 18:57

source share

2 answers

Here is another wording. I will be very interested to see how it scales in a complete dataset. First create this index:

 CREATE INDEX contfloattable_tag_and_timeseg ON contfloattable(tagindex, (floor(extract(epoch FROM dateandtime) / 60 / 15) ));

then run this with work_mem as you can afford:

 SELECT (first_value(x) OVER (PARTITION BY x.tagindex, floor(extract(epoch FROM x.dateandtime) / 60 / 15))).*, (SELECT t.tagname FROM conttagtable t WHERE t.tagindex = x.tagindex) AS tagname FROM contfloattable x ORDER BY dateandtime, tagname;

Sneaky Wombat: Explain the top sql on a complete dataset (without a suggested index): http://explain.depesz.com/s/kGo

Alternatively, here, where only one consecutive pass through contfloattable , with values collected in a tuplestore, which then JOIN ed against to get the tag name. This requires a lot of work_mem :

 SELECT cf.dateandtime, cf.dataVal, cf.status, t.tagname FROM ( SELECT (first_value(x) OVER (PARTITION BY x.tagindex, floor(extract(epoch FROM x.dateandtime) / 60 / 15))).* FROM contfloattable x ) cf INNER JOIN conttagtable t ON cf.tagindex = t.tagindex ORDER BY cf.dateandtime, t.tagname;

Sneaky Wombat: Explain the sql on top of the complete dataset (without suggested index): http://explain.depesz.com/s/57q

If this works, you will want to throw as much work_mem as you can afford in the request. You have not mentioned your system RAM, but you will want to get a decent piece; try:

 SET work_mem = '500MB';

... or more if you have at least 4 GB of RAM and are on a 64-bit processor. Again, I would be very interested to see how this works with the full data set.

By the way, for the correctness of these queries, I would advise ALTER TABLE conttagtable ADD PRIMARY KEY (tagindex); , then DROP INDEX t_tagindex; . This will take some time since a unique index will be created. Most of the queries mentioned here assume that t.tagindex is unique in conttagtable , and this really needs to be done. A unique index can be used for additional optimizations that the old non-unique t_tagindex cannot, and it gives much better statistical estimates.

In addition, when comparing query plans, note that cost not necessarily strictly proportional to real-time execution time. If the ratings are good, then this should be roughly correlated, but the ratings are. Sometimes you will see that an expensive plan is faster than a supposedly inexpensive plan due to things like poor row count or index selectivity estimates, limitations on the query planner's ability to derive relationships, unexpected correlations, or cost parameters like random_page_cost and seq_page_cost , which do not match the real system.

+1

Craig Ringer Oct 17 '12 at 2:05

source share

Clodoaldo neto · Accepted Answer · 2012-10-16T21:23:42+0000

For 15 minute intervals:

 with i as ( select cf.tagindex, min(dateandtime) dateandtime from contfloattable cf group by floor(extract(epoch from dateandtime) / 60 / 15), cf.tagindex ) select cf.dateandtime, cf."Val", cf.status, t.tagname from contfloattable cf inner join conttagtable t on cf.tagindex = t.tagindex inner join i on i.tagindex = cf.tagindex and i.dateandtime = cf.dateandtime order by cf.dateandtime, t.tagname

Show the output result for this query (if it works), so we can try to optimize. You can post it in this answer.

Explain the conclusion

 "Sort (cost=15102462177.06..15263487805.24 rows=64410251271 width=57)" " Sort Key: cf.dateandtime, t.tagname" " CTE i" " -> HashAggregate (cost=49093252.56..49481978.32 rows=19436288 width=12)" " -> Seq Scan on contfloattable cf (cost=0.00..38528881.68 rows=1408582784 width=12)" " -> Hash Join (cost=270117658.06..1067549320.69 rows=64410251271 width=57)" " Hash Cond: (cf.tagindex = t.tagindex)" " -> Merge Join (cost=270117116.39..298434544.23 rows=1408582784 width=25)" " Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))" " -> Sort (cost=2741707.02..2790297.74 rows=19436288 width=12)" " Sort Key: i.tagindex, i.dateandtime" " -> CTE Scan on i (cost=0.00..388725.76 rows=19436288 width=12)" " -> Materialize (cost=267375409.37..274418323.29 rows=1408582784 width=21)" " -> Sort (cost=267375409.37..270896866.33 rows=1408582784 width=21)" " Sort Key: cf.tagindex, cf.dateandtime" " -> Seq Scan on contfloattable cf (cost=0.00..24443053.84 rows=1408582784 width=21)" " -> Hash (cost=335.74..335.74 rows=16474 width=44)" " -> Seq Scan on conttagtable t (cost=0.00..335.74 rows=16474 width=44)"

It looks like you need this index:

 create index cf_tag_datetime on contfloattable (tagindex, dateandtime)

Run analyze after creating it. Now note that any index in a large table will have a significant impact on data changes (insertion, etc.), since they must be updated with every change.

Update

I added the cf_tag_datetime index (tagindex, dateandtime), and here the new one explained:

 "Sort (cost=15349296514.90..15512953953.25 rows=65462975340 width=57)" " Sort Key: cf.dateandtime, t.tagname" " CTE i" " -> HashAggregate (cost=49093252.56..49490287.76 rows=19851760 width=12)" " -> Seq Scan on contfloattable cf (cost=0.00..38528881.68 rows=1408582784 width=12)" " -> Hash Join (cost=270179293.86..1078141313.22 rows=65462975340 width=57)" " Hash Cond: (cf.tagindex = t.tagindex)" " -> Merge Join (cost=270178752.20..298499296.08 rows=1408582784 width=25)" " Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))" " -> Sort (cost=2803342.82..2852972.22 rows=19851760 width=12)" " Sort Key: i.tagindex, i.dateandtime" " -> CTE Scan on i (cost=0.00..397035.20 rows=19851760 width=12)" " -> Materialize (cost=267375409.37..274418323.29 rows=1408582784 width=21)" " -> Sort (cost=267375409.37..270896866.33 rows=1408582784 width=21)" " Sort Key: cf.tagindex, cf.dateandtime" " -> Seq Scan on contfloattable cf (cost=0.00..24443053.84 rows=1408582784 width=21)" " -> Hash (cost=335.74..335.74 rows=16474 width=44)" " -> Seq Scan on conttagtable t (cost=0.00..335.74 rows=16474 width=44)"

It seems that the time has passed :( However, if I remove the order according to the sentence (not quite what I need, but will work), this is what happens, a big reduction:

 "Hash Join (cost=319669581.62..1127631600.98 rows=65462975340 width=57)" " Hash Cond: (cf.tagindex = t.tagindex)" " CTE i" " -> HashAggregate (cost=49093252.56..49490287.76 rows=19851760 width=12)" " -> Seq Scan on contfloattable cf (cost=0.00..38528881.68 rows=1408582784 width=12)" " -> Merge Join (cost=270178752.20..298499296.08 rows=1408582784 width=25)" " Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))" " -> Sort (cost=2803342.82..2852972.22 rows=19851760 width=12)" " Sort Key: i.tagindex, i.dateandtime" " -> CTE Scan on i (cost=0.00..397035.20 rows=19851760 width=12)" " -> Materialize (cost=267375409.37..274418323.29 rows=1408582784 width=21)" " -> Sort (cost=267375409.37..270896866.33 rows=1408582784 width=21)" " Sort Key: cf.tagindex, cf.dateandtime" " -> Seq Scan on contfloattable cf (cost=0.00..24443053.84 rows=1408582784 width=21)" " -> Hash (cost=335.74..335.74 rows=16474 width=44)" " -> Seq Scan on conttagtable t (cost=0.00..335.74 rows=16474 width=44)"

I have not tried this index yet ... I will do it though. In standby.

Now, looking at this again, I think the reverse index can be even better, since it can be used not only in Merge Join , but also in the latest Sort :

 create index cf_tag_datetime on contfloattable (dateandtime, tagindex)

Effectively access a huge time series table on a single line every 15 minutes

More articles: