A collection of adjacent records only with T-SQL

Question

A collection of adjacent records only with T-SQL

I have (simplified for example) a table with the following data

Row Start Finish ID Amount --- --------- ---------- -- ------ 1 2008-10-01 2008-10-02 01 10 2 2008-10-02 2008-10-03 02 20 3 2008-10-03 2008-10-04 01 38 4 2008-10-04 2008-10-05 01 23 5 2008-10-05 2008-10-06 03 14 6 2008-10-06 2008-10-07 02 3 7 2008-10-07 2008-10-08 02 8 8 2008-10-08 2008-11-08 03 19

The dates are a period of time, the identifier is the state the system was in during this period, and the amount is the value related to this state.

What I want to do is combine the sums for adjacent rows with the same identification number, but keep the same common sequence so that adjacent runs can be combined. So I want to get data like:

 Row Start Finish ID Amount --- --------- ---------- -- ------ 1 2008-10-01 2008-10-02 01 10 2 2008-10-02 2008-10-03 02 20 3 2008-10-03 2008-10-05 01 61 4 2008-10-05 2008-10-06 03 14 5 2008-10-06 2008-10-08 02 11 6 2008-10-08 2008-11-08 03 19

I am after a T-SQL solution that can be placed in SP, however I do not see how to do this with simple queries. I suspect that iteration may be required, but I do not want to go that route.

The reason I want to do this is because the next step in this process is to execute SUM () and Count (), grouped by a unique identifier that appears in the sequence, so my final data will look something like this:

 ID Counts Total -- ------ ----- 01 2 71 02 2 31 03 2 33

However, if I make it simple

 SELECT COUNT(ID), SUM(Amount) FROM data GROUP BY ID

In the source table, I get something like

 ID Counts Total -- ------ ----- 01 3 71 02 3 31 03 2 33

This is not what I want.

+4

sql tsql aggregate temporal-database

Peter M Oct 24 '08 at 10:08

source share

4 answers

Jonathan leffler · Answer 1 · 2008-10-25T05:31:23+0000

If you read the book “Developing Time-Oriented Applications in SQL”, RT Snodgrass (the pdf file of which can be found in its website under publications), and reaching Figure 6.25 on pages 165-166, you will find non-trivial SQL that can be used in the current example to group different lines with the same identifier value and continuous time intervals.

The version of the query below is close to correct, but at the end there is a problem that has its source in the first SELECT statement. I still do not understand the reason for the wrong answer. [If someone can test SQL in their DBMS and tell me if the first query works correctly there, that will be a big help!]

It looks something like this:

 -- Derived from Figure 6.25 from Snodgrass "Developing Time-Oriented -- Database Applications in SQL" CREATE TABLE Data ( Start DATE, Finish DATE, ID CHAR(2), Amount INT ); INSERT INTO Data VALUES('2008-10-01', '2008-10-02', '01', 10); INSERT INTO Data VALUES('2008-10-02', '2008-10-03', '02', 20); INSERT INTO Data VALUES('2008-10-03', '2008-10-04', '01', 38); INSERT INTO Data VALUES('2008-10-04', '2008-10-05', '01', 23); INSERT INTO Data VALUES('2008-10-05', '2008-10-06', '03', 14); INSERT INTO Data VALUES('2008-10-06', '2008-10-07', '02', 3); INSERT INTO Data VALUES('2008-10-07', '2008-10-08', '02', 8); INSERT INTO Data VALUES('2008-10-08', '2008-11-08', '03', 19); SELECT DISTINCT F.ID, F.Start, L.Finish FROM Data AS F, Data AS L WHERE F.Start < L.Finish AND F.ID = L.ID -- There are no gaps between F.Finish and L.Start AND NOT EXISTS (SELECT * FROM Data AS M WHERE M.ID = F.ID AND F.Finish < M.Start AND M.Start < L.Start AND NOT EXISTS (SELECT * FROM Data AS T1 WHERE T1.ID = F.ID AND T1.Start < M.Start AND M.Start <= T1.Finish)) -- Cannot be extended further AND NOT EXISTS (SELECT * FROM Data AS T2 WHERE T2.ID = F.ID AND ((T2.Start < F.Start AND F.Start <= T2.Finish) OR (T2.Start <= L.Finish AND L.Finish < T2.Finish)));

The result of this query:

 01 2008-10-01 2008-10-02 01 2008-10-03 2008-10-05 02 2008-10-02 2008-10-03 02 2008-10-06 2008-10-08 03 2008-10-05 2008-10-06 03 2008-10-05 2008-11-08 03 2008-10-08 2008-11-08

Edited . The problem with the penultimate line - it should not be. And I don’t know (yet) where it comes from.

Now we need to treat this complex expression as a query expression in the FROM clause of another SELECT statement, which sums the quantity values for a given identifier over elements that overlap with the maximum ranges shown above.

 SELECT M.ID, M.Start, M.Finish, SUM(D.Amount) FROM Data AS D, (SELECT DISTINCT F.ID, F.Start, L.Finish FROM Data AS F, Data AS L WHERE F.Start < L.Finish AND F.ID = L.ID -- There are no gaps between F.Finish and L.Start AND NOT EXISTS (SELECT * FROM Data AS M WHERE M.ID = F.ID AND F.Finish < M.Start AND M.Start < L.Start AND NOT EXISTS (SELECT * FROM Data AS T1 WHERE T1.ID = F.ID AND T1.Start < M.Start AND M.Start <= T1.Finish)) -- Cannot be extended further AND NOT EXISTS (SELECT * FROM Data AS T2 WHERE T2.ID = F.ID AND ((T2.Start < F.Start AND F.Start <= T2.Finish) OR (T2.Start <= L.Finish AND L.Finish < T2.Finish)))) AS M WHERE D.ID = M.ID AND M.Start <= D.Start AND M.Finish >= D.Finish GROUP BY M.ID, M.Start, M.Finish ORDER BY M.ID, M.Start;

This gives:

 ID Start Finish Amount 01 2008-10-01 2008-10-02 10 01 2008-10-03 2008-10-05 61 02 2008-10-02 2008-10-03 20 02 2008-10-06 2008-10-08 11 03 2008-10-05 2008-10-06 14 03 2008-10-05 2008-11-08 33 -- Here be trouble! 03 2008-10-08 2008-11-08 19

Edited . This is an almost correct dataset on which to aggregate the COUNT and SUM requested by the original question, so the final answer is:

 SELECT I.ID, COUNT(*) AS Number, SUM(I.Amount) AS Amount FROM (SELECT M.ID, M.Start, M.Finish, SUM(D.Amount) AS Amount FROM Data AS D, (SELECT DISTINCT F.ID, F.Start, L.Finish FROM Data AS F, Data AS L WHERE F.Start < L.Finish AND F.ID = L.ID -- There are no gaps between F.Finish and L.Start AND NOT EXISTS (SELECT * FROM Data AS M WHERE M.ID = F.ID AND F.Finish < M.Start AND M.Start < L.Start AND NOT EXISTS (SELECT * FROM Data AS T1 WHERE T1.ID = F.ID AND T1.Start < M.Start AND M.Start <= T1.Finish)) -- Cannot be extended further AND NOT EXISTS (SELECT * FROM Data AS T2 WHERE T2.ID = F.ID AND ((T2.Start < F.Start AND F.Start <= T2.Finish) OR (T2.Start <= L.Finish AND L.Finish < T2.Finish))) ) AS M WHERE D.ID = M.ID AND M.Start <= D.Start AND M.Finish >= D.Finish GROUP BY M.ID, M.Start, M.Finish ) AS I GROUP BY I.ID ORDER BY I.ID; id number amount 01 2 71 02 2 31 03 3 66

Review : Ouch! Drat ... the entry for 3 has twice as much the "amount" that it should have. The previous “edited” parts indicate where everything went wrong. It seems that either the first request is subtly erroneous (maybe it is for a different question), or the optimizer I'm working with is incorrect. However, there must be an answer closely related to this that will give the correct meaning.

For the record: tested on IBM Informix Dynamic Server 11.50 on Solaris 10. However, it should work fine on any other mid-sized SQL DBMS.

tvanfosson · Answer 2 · 2008-10-24T22:11:47+0000

Perhaps you need to create a cursor and view the results, keeping track of which identifier you are working with and accumulating data along the way. When changing the identifier, you can insert the accumulated data into the temporary table and return the table at the end of the procedure (select all from it). A table-based function might be better, since you can simply insert a return table into the table as you go.

Dave_h · Answer 3 · 2008-10-25T02:19:22+0000

I suspect that iteration may be required, but I do not want to go that route.

I think the route you will need to do is use the cursor to populate the table variable. If you have a large number of records, you can use a permanent table to store the results, then when you need to get data, you can only process new data.

I would add a bit field with default 0 to the source table to keep track of which records were processed. Assuming no one is using select * in the table, adding a column with a default value will not affect the rest of your application.

Add a comment on this post if you want to help coding the solution.

Peter M · Answer 4 · 2008-10-27T16:45:35+0000

Well, I decided to go the iterative route using a mixture of joins and cursors. By attaching the data table to myself, I can create a list of links only for those records that are sequential.

 INSERT INTO #CONSEC SELECT a.ID, a.Start, b.Finish, b.Amount FROM Data a JOIN Data b ON (a.Finish = b.Start) AND (a.ID = b.ID)

Then I can unwind the list, iterate over it with the cursor, and make updates back to the data table for customization (and now delete extraneous entries from the data table)

 DECLARE CCursor CURSOR FOR SELECT ID, Start, Finish, Amount FROM #CONSEC ORDER BY Start DESC @Total = 0 OPEN CCursor FETCH NEXT FROM CCursor INTO @ID, @START, @FINISH, @AMOUNT WHILE @FETCH_STATUS = 0 BEGIN @Total = @Total + @Amount @Start_Last = @Start @Finish_Last = @Finish @ID_Last = @ID DELETE FROM Data WHERE Start = @Finish FETCH NEXT FROM CCursor INTO @ID, @START, @FINISH, @AMOUNT IF (@ID_Last<> @ID) OR (@Finish<>@Start_Last) BEGIN UPDATE Data SET Amount = Amount + @Total WHERE Start = @Start_Last @Total = 0 END END CLOSE CCursor DEALLOCATE CCursor

It all works and has acceptable performance for the typical data that I use.

I found a little problem with the code above. Initially, I updated the data table in each loop through the cursor. But that did not work. It seems that you can make only one update per record and that several updates (to keep adding data) have returned to reading the original contents of the record.

A collection of adjacent records only with T-SQL

More articles: