The best approach to automatic data aggregation in a database

Question

The best approach to automatic data aggregation in a database

We are currently developing a web application that processes a huge amount of archived data that is in the database table. The data row in the table consists of one unique row identifier, two identifiers that identify the machine and date-point, value and timestamp. Each machine sends its data to this table whenever a change in value exceeds a predetermined threshold. A table usually contains from millions to hundreds of millions of records.

For visualization purposes, I created a stored procedure that accepts two identifiers needed to identify the machine and the date, as well as the date and time of the start and end. Then it combines the values between the beginning and the end into pieces of variable length (usually 15 minutes, 1 hour, 7 days, etc.) and returns the average, minimum and maximum values for each fragment for a given time interval.

This method works, but it takes a lot of time, even with a large number of optimizations and database indices. Thus, on the frontend chart page, it takes about 10-60 seconds to display data for the selected range and machine, which is too much, I think.

So, I started thinking about creating a new table containing pre-aggregated data for each machine for each "piece". For this, the aggregation procedure will need to be automatically called every [chunksize] minutes / hours / days for each machine. Coarser pieces could easily be created from smaller pieces, etc. As far as I can see, this will greatly accelerate all this.

The question is: what is the best way to implement periodic aggregation? Is there a way to make the database do the work myself? Or do I need to implement a timer solution in an ASP.NET MVC web application? The latter would require the web application to always run, which is probably not the best way, as it may not be available for various reasons. Another option would be an independent application or service that will perform this task. Are there other ways that I have not thought about? How do you approach this problem?

+5

c # sql-server asp.net-mvc sql-server-2012 stored-procedures

Robert May 18, '15 at 6:33

source share

1 answer

Vladimir Baranov · Accepted Answer · 2015-05-18T08:10:11+0000

Our system has one table with the original source data. These raw data are summarized in hourly, daily, and weekly intervals (total, min, max raw values for each interval).

We store the initial data for 30 days (4 weeks), hourly for 43 days (6 weeks), daily for 560 days (18 months), weekly for 10 years. Every night, these four tables are “cleared” and data older than the threshold is deleted. The hourly table has about 30M rows, daily 18M rows. Some reports / charts use hourly data, most of which use daily data. Sometimes we need to look at the raw data for a detailed study of the problem.

I have a special application written in C ++ that runs on a 24/7 server and collects raw data from ~ 200 other servers and inserts them into a central database. Inside the application, I periodically (every 10 minutes) call a stored procedure that recounts the summaries. This stored procedure can also be performed by the end user at any time if the user wants to see the latest data. It usually takes about 10 seconds to start, so the end user usually sees a delayed resume. Thus, a scheduled task can be run on the server every 10 minutes. When I do this through the application, I have finer control over other threads that collect data. Essentially, I suspend attempts to insert new data while summarizing it. But you can achieve the same effect using only independent stored procedures.

In my case, I could make recounting quite effective.

As I enter new data into the database during this 10-minute window, I insert the raw data directly into the main table. The source data points are never updated, they are only added (inserted). So this step is simple and effective. I use a stored procedure with the table-valued parameter and pass a piece of new data in one call. Thus, many rows are inserted into a single INSERT , which is good.
Pivot tables are updated with new data every 10 minutes using a second stored procedure. Some of the existing lines must be updated, some lines are added. To do this efficiently, I have a separate “staging” table with machine identifiers, hourly date-time, daily date-time, and weekly date and time columns. When I insert raw data into the main table, I also insert corrupted computer identifiers and time slots into this staging table.

So, there are two main stored procedures. The application uses 200 remote servers using multiple threads and downloads fresh data from each server in an endless loop. As soon as a fresh batch of data is downloaded from some remote server, the first stored procedure is called. This happens often. This procedure inserts the raw data packet into the raw table as is and inserts a list of the affected time intervals into the "setting" table.

Let's say an incoming raw data batch looks like this:

 ID timestamp raw_value 1 2015-01-01 23:54:45 123 1 2015-01-01 23:57:12 456 1 2015-01-02 00:03:23 789 2 2015-01-02 02:05:21 909

4 rows are inserted into the main table (ID, timestamp, value).

3 rows are inserted into the staging table (usually with a time stamp of one hour there are many values, so there are a lot of raw rows in the staging table, but several):

 ID hourlytimestamp dailytimestamp weeklytimestamp 1 2015-01-01 23:00:00 2015-01-01 00:00:00 2014-12-29 00:00:00 1 2015-01-02 00:00:00 2015-01-02 00:00:00 2014-12-29 00:00:00 2 2015-01-02 00:00:00 2015-01-02 00:00:00 2014-12-29 00:00:00

Note that here I collect / condense / merge all identifiers and timestamps into a unique set, and this staging table has no meaning at all, it contains only the affected identifiers and time intervals ( StatsToRecalc - this staging table, @ParamRows is a parameter of the stored procedure , which has a batch of rows with new data):

 DECLARE @VarStart datetime = '20000103'; -- it is Monday INSERT INTO dbo.StatsToRecalc (ID ,PeriodBeginLocalDateTimeHour ,PeriodBeginLocalDateTimeDay ,PeriodBeginLocalDateTimeWeek) SELECT DISTINCT TT.[ID], -- Truncate time to 1 hour. DATEADD(hour, DATEDIFF(hour, @VarStart, TT.PlaybackStartedLocalDateTime), @VarStart), -- Truncate time to 1 day. DATEADD(day, DATEDIFF(day, @VarStart, TT.PlaybackStartedLocalDateTime), @VarStart), -- Truncate time to 1 week. DATEADD(day, ROUND(DATEDIFF(day, @VarStart, TT.PlaybackStartedLocalDateTime) / 7, 0, 1) * 7, @VarStart) FROM @ParamRows AS TT;

Then there is a simple INSERT in the source table from @ParamRows .

So, there are many INSERTS in the source and intermediate tables, using this procedure from many threads within 10 minutes.

Every 10 minutes, a second procedure is called, which recounts the reports.

The first thing he does is start the transaction and lock the staging table until the end of the transaction:

 SELECT @VarCount = COUNT(*) FROM dbo.StatsToRecalc WITH (HOLDLOCK)

If the StatsToRecalc staging table StatsToRecalc not empty, we need to do something. Since this table is locked, all workflows will not interfere and will wait for the recount to complete before adding additional data.

Using this staging table, I can quickly determine which hours, days, and weeks for which identifiers I need to count. The actual final calculation is performed in the MERGE , which processes all the affected identifiers and the interval at a time. I run three MERGEs to summarize the raw data into an hourly summary, then hourly daily, and then daily weekly. Then the stop-stop is emptied (every 10 minutes), so it never gets too big.

Each MERGE first creates a list of identifiers and timestamps that have been affected since the last recount (for example, to update the daily table daily):

 WITH CTE_Changed (ID, PeriodBeginLocalDateTimeDay) AS ( SELECT dbo.StatsToRecalc.ID , dbo.StatsToRecalc.PeriodBeginLocalDateTimeDay FROM dbo.StatsToRecalc GROUP BY dbo.StatsToRecalc.ID ,dbo.StatsToRecalc.PeriodBeginLocalDateTimeDay )

and then joins the hour table with this CTE in MERGE :

 MERGE INTO dbo.StatsDay AS Dest USING ( SELECT ... FROM dbo.StatsHour INNER JOIN CTE_Changed ON CTE_Changed.ID = dbo.StatsHour.ID AND CTE_Changed.PeriodBeginLocalDateTimeDay = dbo.StatsHour.PeriodBeginLocalDateTimeDay ) ...

To help with this multi-step summation, I have helper columns in raw, hourly, and daily tables. For example, an hourly table has a PeriodBeginLocalDateTimeHour column that contains these values:

 2015-01-01 22:00:00 2015-01-01 23:00:00 2015-01-02 00:00:00 2015-01-02 01:00:00 ...

that is, the boundaries of the hour. At the same time, there is a second column that contains these timestamps "truncated" to the border of the day: PeriodBeginLocalDateTimeDay , which contains these values:

 2015-01-01 00:00:00 2015-01-02 00:00:00 ...

that is, the boundaries of the day. The second column is used only when I summarize the hours by day - I do not need to calculate the timestamp of the day on the fly, but use constant indexed values.

I must add that in my case it is normal if this special C ++ application has been disabled for some time. It just means that the data will be delayed for more than 10 minutes, but nothing will be lost.

The best approach to automatic data aggregation in a database

More articles: