The fastest way to get the closest data from multiple time-based tables

I have three tables with the following setting:

TEMPERATURE_1 time zone (FK) temperature TEMPERATURE_2 time zone (FK) temperature TEMPERATURE_3 time zone (FK) temperature 

The data in each table is updated periodically, but not necessarily at the same time (i.e. time records are not identical).

I want to have access to the nearest reading from each table for every time, that is:

 TEMPERATURES time zone (FK) temperature_1 temperature_2 temperature_3 

In other words, for each unique time in my three tables, I need a row in the TEMPERATURES table, where the temperature_n values ​​are the temperature indicators closest in time to each source table.

At the moment, I have installed this using two views:

 create view temptimes as select time, zone from temperature_1 union select time, zone from temperature_2 union select time, zone from temperature_3; create view temperatures as select tt.time, tt.zone, (select temperature from temperature_1 order by abs(timediff(time, tt.time)) limit 1) as temperature_1, (select temperature from temperature_2 order by abs(timediff(time, tt.time)) limit 1) as temperature_2, (select temperature from temperature_3 order by abs(timediff(time, tt.time)) limit 1) as temperature_3, from temptimes as tt order by tt.time; 

This approach works, but is too slow to be used in production (for small data sets ~ 1000 records for each temperature takes + minutes).

I am not good at SQL, so I am sure that I am missing the right way to do this. How do I approach the problem?

+4
source share
3 answers

The dear part is where the correlated subqueries must calculate the time difference for each individual row of each temperature_* table to find only one closest row for one column of one row in the main query.

This would be much faster if you could just select one row after and one row before the current time according to the index and only calculate the time difference for these two candidates. All you need for fast is the index in the time column in your tables.

I ignore the zone column, as its role remains unclear in the question, and it just adds more noise to the main problem. It should be easy to add to the request.

Without further submission, this query does it all at once:

 SELECT time ,COALESCE(temp1 ,CASE WHEN timediff(time, time1a) > timediff(time1b, time) THEN (SELECT t.temperature FROM temperature_1 t WHERE t.time = y.time1b) ELSE (SELECT t.temperature FROM temperature_1 t WHERE t.time = y.time1a) END) AS temp1 ,COALESCE(temp2 ,CASE WHEN timediff(time, time2a) > timediff(time2b, time) THEN (SELECT t.temperature FROM temperature_2 t WHERE t.time = y.time2b) ELSE (SELECT t.temperature FROM temperature_2 t WHERE t.time = y.time2a) END) AS temp2 ,COALESCE(temp3 ,CASE WHEN timediff(time, time3a) > timediff(time3b, time) THEN (SELECT t.temperature FROM temperature_3 t WHERE t.time = y.time3b) ELSE (SELECT t.temperature FROM temperature_3 t WHERE t.time = y.time3a) END) AS temp3 FROM ( SELECT time ,max(t1) AS temp1 ,max(t2) AS temp2 ,max(t3) AS temp3 ,CASE WHEN max(t1) IS NULL THEN (SELECT t.time FROM temperature_1 t WHERE t.time < x.time ORDER BY t.time DESC LIMIT 1) ELSE NULL END AS time1a ,CASE WHEN max(t1) IS NULL THEN (SELECT t.time FROM temperature_1 t WHERE t.time > x.time ORDER BY t.time LIMIT 1) ELSE NULL END AS time1b ,CASE WHEN max(t2) IS NULL THEN (SELECT t.time FROM temperature_2 t WHERE t.time < x.time ORDER BY t.time DESC LIMIT 1) ELSE NULL END AS time2a ,CASE WHEN max(t2) IS NULL THEN (SELECT t.time FROM temperature_2 t WHERE t.time > x.time ORDER BY t.time LIMIT 1) ELSE NULL END AS time2b ,CASE WHEN max(t3) IS NULL THEN (SELECT t.time FROM temperature_3 t WHERE t.time < x.time ORDER BY t.time DESC LIMIT 1) ELSE NULL END AS time3a ,CASE WHEN max(t3) IS NULL THEN (SELECT t.time FROM temperature_3 t WHERE t.time > x.time ORDER BY t.time LIMIT 1) ELSE NULL END AS time3b FROM ( SELECT time, temperature AS t1, NULL AS t2, NULL AS t3 FROM temperature_1 UNION ALL SELECT time, NULL AS t1, temperature AS t2, NULL AS t3 FROM temperature_2 UNION ALL SELECT time, NULL AS t1, NULL AS t2, temperature AS t3 FROM temperature_3 ) AS x GROUP BY time ) y ORDER BY time; 

β†’ sqlfiddle

To explain

suqquery x replaces your temptimes and brings temperature to the result. If all three tables are synchronized and have a temperature at all identical points in time, the rest is not even necessary and very fast.
For each point in time when one of the three tables does not have a row, the temperature is selected in accordance with the instructions: take the "closest" from each table.

suqquery y concatenates the rows from x and selects the previous time ( time1a ) and the next time ( time1b ) according to the current time from each table where there is no temperature. These searches should be fast using an index.

the final query selects the temperature from the row with the closest time for each actually absent temperature.

This query might be easier if MySQL allows you to reference columns from more than one level above the current subquery. Bit can't. Works fine with PostgreSQL : -> sqlfiddle

It would also be simpler if more than one column could be returned from a correlated subquery, but I don't know how to do it in MySQL.

And it would be much simpler with CTE and window functions, but MySQL does not know these modern SQL functions (unlike other corresponding DBMSs).

0
source

The reason this happens slowly is because 3 scan tables are required to compute and organize the differences.

I assume that you already have indexes in the time zone columns - at the moment they will not help because of the problem with viewing the table.

There are many options to avoid this, depending on what you need and what the data collection performance is.

You have already said that data is collected periodically, but not simultaneously. This suggests several options.

  • At what level of importance do you need temporary data - day, hour, minute, etc. Keep time zone information only at that significance level (or run another column) and fulfill your queries on this.
  • If you know that the time of 3 cabinets will be within a certain time interval (hour, day, etc.), enter the where clause to limit the calculation to the time that are potential candidates. You effectively create buckets such as a histogram - for this you need a calendar table.
  • Make the comparison unidirectional, i.e. limit your attention only to the time you were looking for, so if you are looking for 12:00:00, then 13:45:32 is a candidate, but 11:59:59 isn’t.

I understand what you are trying to accomplish - ask yourself why and if a simpler solution will help meet your needs.

0
source

My suggestion is that you do not take the closest time, but you take the first time during or before the set time. The reason for this is simple: usually data for a given point in time is known at that time. Including information about the future as a whole is not a good idea for most purposes.

With this change, you can modify your query to use the time index. The problem with the index in your query is that the function excludes the use of the index.

So, if you want the latest temperature, use it for each variable:

  (select temperature from temperature_1 t2 where t2.time <= tt.time order by t2.time desc limit 1 ) as temperature_1, 

Actually, you can also build it like this:

  (select time from temperature_1 t2 where t2.time <= tt.time order by t2.time desc limit 1 ) as time_1, 

And then attach the temperature information back. It will be efficient using an index.

With that in mind, you could have two variables time_1_before and time_1_after , for the best time or earlier and the best time or after. You can use the logic in the selection to select the closest value. Combining back to temperature should be effective using the index.

But, I repeat, I think that the last temperature in this case may be the best choice.

0
source

All Articles