Group data by changing the grouping column value in order

With the following data

create table #ph (product int, [date] date, price int) insert into #ph select 1, '20120101', 1 insert into #ph select 1, '20120102', 1 insert into #ph select 1, '20120103', 1 insert into #ph select 1, '20120104', 1 insert into #ph select 1, '20120105', 2 insert into #ph select 1, '20120106', 2 insert into #ph select 1, '20120107', 2 insert into #ph select 1, '20120108', 2 insert into #ph select 1, '20120109', 1 insert into #ph select 1, '20120110', 1 insert into #ph select 1, '20120111', 1 insert into #ph select 1, '20120112', 1 

I would like to create the following output:

 product | date_from | date_to | price 1 | 20120101 | 20120105 | 1 1 | 20120105 | 20120109 | 2 1 | 20120109 | 20120112 | 1 

If I group by price and show the date max and min, then I will get the following, which is not what I want (see date overflow).

 product | date_from | date_to | price 1 | 20120101 | 20120112 | 1 1 | 20120105 | 20120108 | 2 

Thus, essentially, what I want to do is group change the step in the data based on the group columns of the product and price.

What is the cleanest way to achieve this?

+8
sql tsql sql-server-2008 gaps-and-islands
source share
4 answers

There is a (more or less) well-known method for solving this problem, including two calls to ROW_NUMBER() , for example:

 WITH marked AS ( SELECT *, grp = ROW_NUMBER() OVER (PARTITION BY product ORDER BY date) - ROW_NUMBER() OVER (PARTITION BY product, price ORDER BY date) FROM #ph ) SELECT product, date_from = MIN(date), date_to = MAX(date), price FROM marked GROUP BY product, price, grp ORDER BY product, MIN(date) 

Output:

 product date_from date_to price ------- ---------- ------------- ----- 1 2012-01-01 2012-01-04 1 1 2012-01-05 2012-01-08 2 1 2012-01-09 2012-01-12 1 
+18
source share

I am new to this forum, so I hope that my contribution will be useful.

If you really don't want to use CTE (although I think this is probably the best approach), you can get the solution using set-based code. You will need to check the performance of this code!

I added temp to the extra table so that I can use a unique identifier for each record, but I suspect you will already have this column in the original table. Thus, there is a pace table.

  If Exists (SELECT Name FROM tempdb.sys.tables WHERE name LIKE '#phwithId%') DROP TABLE #phwithId CREATE TABLE #phwithId ( SaleId INT , ProductID INT , Price Money , SaleDate Date ) INSERT INTO #phwithId SELECT row_number() over(partition by product order by [date] asc) as SalesId, Product, Price, Date FROM ph 

Now the main element of the Select statement

  SELECT productId , date_from , date_to , Price FROM ( SELECT dfr.ProductId , ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY ChangeDate) AS rowno1 , ChangeDate AS date_from , dfr.Price FROM ( SELECT sl1.ProductId AS ProductId , sl1.SaleDate AS ChangeDate , sl1.price FROM #phwithId sl1 LEFT JOIN #phwithId sl2 ON sl1.SaleId = sl2.SaleId + 1 WHERE sl1.Price <> sl2.Price OR sl2.Price IS NULL ) dfr ) da1 LEFT JOIN ( SELECT ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY ChangeDate) AS rowno2 , ChangeDate AS date_to FROM ( SELECT sl1.ProductId , sl1.SaleDate AS ChangeDate FROM #phwithId sl1 LEFT JOIN #phwithId sl3 ON sl1.SaleId = sl3.SaleId - 1 WHERE sl1.Price <> sl3.Price OR sl3.Price IS NULL ) dto ) da2 ON da1.rowno1 = da2.rowno2 

By linking the shift of the data source by 1 record (+ or-), we can determine when price buckets change, and then just ask whether the start and end dates for the buckets are returned in one record.

Everything is a bit strange, and I'm not sure that it will give the best performance, but I liked the task.

+2
source share

One of the solutions I chose is relatively "clean":

 ;with cte_sort (product, [date], price, [row]) as (select product, [date], price, row_number() over(partition by product order by [date] asc) as row from #ph) select a.product, a.[date] as date_from, c.[date] as date_to, a.price from cte_sort a left outer join cte_sort b on a.product = b.product and (a.row+1) = b.row and a.price = b.price outer apply (select top 1 [date] from cte_sort z where z.product = a.product and z.row > a.row order by z.row) c where b.row is null order by a.[date] 

I used CTE with row_number because you do not need to worry about any dates missing if you use functions like dateadd . Obviously, you only need an external application if you want to have a date_to column (which I do).

This solution really solves my problem, but I have a small problem so that it runs as fast as I would like on my 5 million row table.

0
source share
 Create function [dbo].[AF_TableColumns](@table_name nvarchar(55)) returns nvarchar(4000) as begin declare @str nvarchar(4000) select @str = cast(rtrim(ltrim(column_name)) as nvarchar(500)) + coalesce(' ' + @str , ' ') from information_schema.columns where table_name = @table_name group by table_name, column_name, ordinal_position order by ordinal_position DESC return @str end --select dbo.AF_TableColumns('YourTable') Select * from YourTable 
-one
source share

All Articles