Our system has one table with the original source data. These raw data are summarized in hourly, daily, and weekly intervals (total, min, max raw values for each interval).
We store the initial data for 30 days (4 weeks), hourly for 43 days (6 weeks), daily for 560 days (18 months), weekly for 10 years. Every night, these four tables are “cleared” and data older than the threshold is deleted. The hourly table has about 30M rows, daily 18M rows. Some reports / charts use hourly data, most of which use daily data. Sometimes we need to look at the raw data for a detailed study of the problem.
I have a special application written in C ++ that runs on a 24/7 server and collects raw data from ~ 200 other servers and inserts them into a central database. Inside the application, I periodically (every 10 minutes) call a stored procedure that recounts the summaries. This stored procedure can also be performed by the end user at any time if the user wants to see the latest data. It usually takes about 10 seconds to start, so the end user usually sees a delayed resume. Thus, a scheduled task can be run on the server every 10 minutes. When I do this through the application, I have finer control over other threads that collect data. Essentially, I suspend attempts to insert new data while summarizing it. But you can achieve the same effect using only independent stored procedures.
In my case, I could make recounting quite effective.
As I enter new data into the database during this 10-minute window, I insert the raw data directly into the main table. The source data points are never updated, they are only added (inserted). So this step is simple and effective. I use a stored procedure with the table-valued parameter and pass a piece of new data in one call. Thus, many rows are inserted into a single INSERT , which is good.
Pivot tables are updated with new data every 10 minutes using a second stored procedure. Some of the existing lines must be updated, some lines are added. To do this efficiently, I have a separate “staging” table with machine identifiers, hourly date-time, daily date-time, and weekly date and time columns. When I insert raw data into the main table, I also insert corrupted computer identifiers and time slots into this staging table.
So, there are two main stored procedures. The application uses 200 remote servers using multiple threads and downloads fresh data from each server in an endless loop. As soon as a fresh batch of data is downloaded from some remote server, the first stored procedure is called. This happens often. This procedure inserts the raw data packet into the raw table as is and inserts a list of the affected time intervals into the "setting" table.
Let's say an incoming raw data batch looks like this:
ID timestamp raw_value 1 2015-01-01 23:54:45 123 1 2015-01-01 23:57:12 456 1 2015-01-02 00:03:23 789 2 2015-01-02 02:05:21 909
4 rows are inserted into the main table (ID, timestamp, value).
3 rows are inserted into the staging table (usually with a time stamp of one hour there are many values, so there are a lot of raw rows in the staging table, but several):
ID hourlytimestamp dailytimestamp weeklytimestamp 1 2015-01-01 23:00:00 2015-01-01 00:00:00 2014-12-29 00:00:00 1 2015-01-02 00:00:00 2015-01-02 00:00:00 2014-12-29 00:00:00 2 2015-01-02 00:00:00 2015-01-02 00:00:00 2014-12-29 00:00:00
Note that here I collect / condense / merge all identifiers and timestamps into a unique set, and this staging table has no meaning at all, it contains only the affected identifiers and time intervals ( StatsToRecalc - this staging table, @ParamRows is a parameter of the stored procedure , which has a batch of rows with new data):
DECLARE @VarStart datetime = '20000103';
Then there is a simple INSERT in the source table from @ParamRows .
So, there are many INSERTS in the source and intermediate tables, using this procedure from many threads within 10 minutes.
Every 10 minutes, a second procedure is called, which recounts the reports.
The first thing he does is start the transaction and lock the staging table until the end of the transaction:
SELECT @VarCount = COUNT(*) FROM dbo.StatsToRecalc WITH (HOLDLOCK)
If the StatsToRecalc staging table StatsToRecalc not empty, we need to do something. Since this table is locked, all workflows will not interfere and will wait for the recount to complete before adding additional data.
Using this staging table, I can quickly determine which hours, days, and weeks for which identifiers I need to count. The actual final calculation is performed in the MERGE , which processes all the affected identifiers and the interval at a time. I run three MERGEs to summarize the raw data into an hourly summary, then hourly daily, and then daily weekly. Then the stop-stop is emptied (every 10 minutes), so it never gets too big.
Each MERGE first creates a list of identifiers and timestamps that have been affected since the last recount (for example, to update the daily table daily):
WITH CTE_Changed (ID, PeriodBeginLocalDateTimeDay) AS ( SELECT dbo.StatsToRecalc.ID , dbo.StatsToRecalc.PeriodBeginLocalDateTimeDay FROM dbo.StatsToRecalc GROUP BY dbo.StatsToRecalc.ID ,dbo.StatsToRecalc.PeriodBeginLocalDateTimeDay )
and then joins the hour table with this CTE in MERGE :
MERGE INTO dbo.StatsDay AS Dest USING ( SELECT ... FROM dbo.StatsHour INNER JOIN CTE_Changed ON CTE_Changed.ID = dbo.StatsHour.ID AND CTE_Changed.PeriodBeginLocalDateTimeDay = dbo.StatsHour.PeriodBeginLocalDateTimeDay ) ...
To help with this multi-step summation, I have helper columns in raw, hourly, and daily tables. For example, an hourly table has a PeriodBeginLocalDateTimeHour column that contains these values:
2015-01-01 22:00:00 2015-01-01 23:00:00 2015-01-02 00:00:00 2015-01-02 01:00:00 ...
that is, the boundaries of the hour. At the same time, there is a second column that contains these timestamps "truncated" to the border of the day: PeriodBeginLocalDateTimeDay , which contains these values:
2015-01-01 00:00:00 2015-01-02 00:00:00 ...
that is, the boundaries of the day. The second column is used only when I summarize the hours by day - I do not need to calculate the timestamp of the day on the fly, but use constant indexed values.
I must add that in my case it is normal if this special C ++ application has been disabled for some time. It just means that the data will be delayed for more than 10 minutes, but nothing will be lost.