T-SQL: Best Moving Distribution / Query Function

I need a T-SQL ranking approach similar to that provided by NTILE (), except that the members of each tile will be in a rolling distribution so that earlier tiles have fewer members.

for example

CREATE TABLE #Rank_Table( id int identity(1,1) not null, hits bigint not null default 0, PERCENTILE smallint null ) --Slant the distribution of the data INSERT INTO #Rank_Table (hits) select CASE when DATA > 9500 THEN DATA*30 WHEN data > 8000 THEN DATA*5 WHEN data < 7000 THEN DATA/3 +1 ELSE DATA END FROM (select top 10000 (ABS(CHECKSUM(NewId())) % 99 +1) * (ABS(CHECKSUM(NewId())) % 99 +1 ) DATA from master..spt_values t1 cross JOIN master..spt_values t2) exponential Declare @hitsPerGroup as bigint Declare @numGroups as smallint set @numGroups=100 select @hitsPerGroup=SUM(hits)/(@numGroups -1) FROM #Rank_Table select @hitsPerGroup HITS_PER_GROUP --This is an even distribution SELECT id,HITS, NTILE(@numGroups) Over (Order By HITS DESC) PERCENTILE FROM #Rank_Table GROUP by id, HITS --This is my best attempt, but it skips groups because of the erratic distribution select T1.ID, T1.hits, T.RunningTotal/@hitsPerGroup + 1 TILE, T.RunningTotal FROM #Rank_Table T1 CROSS APPLY ( Select SUM(hits) RunningTotal FROM #Rank_Table where hits <= T1.hits) T order by T1.hits DROP TABLE #Rank_Table 

In #Rank_table, NTILE (@numGroups) creates an even distribution of @numGroups groups. I need @numGroups groups in which tile 1 has the least number of members, tile 2 will have one or more than tile 1, tile 3 will have 1 or more than tile 2 ... tile 100 will be itself.

I use SQL Server 2008. In practice, this will run against a constant table with potentially millions of rows to periodically update the PERCENTILE column with a percentile from 1 to 100.

My best attempt above will skip the percentiles and work poorly. There must be a better way.

+6
tsql statistics sql-server-2008 tile
source share
2 answers
+1
source share

To create a more linear distribution, I added a calculated column to the data table, HITS_SQRT HITS_SQRT AS (CONVERT([int],sqrt(HITS*4),(0))) PERSISTED .

Using this column, you can calculate the target number of hits per percentile.

 select @hitsPerGroup=SUM(HITS_SQRT)/(@numGroups -1) -@numGroups , @dataPoints=COUNT(*) FROM #Rank_Table 

Then the script creates a temporary table with ROW_NUMBER () sorted by the number of hits, and repeats the rows in descending order, updating its percentile from 100 to 1. The total amount of hits is @hitsPerGroup and when @hitsPerGroup is @hitsPerGroup , the percentile goes down from 100 to 99, from 99 up to 98 etc.

Then the source data table is updated with its percentile. To speed up the update, there is a temp worksheet index.

Full script using #Rank_Table as the source data table.

 --Create Test Data CREATE TABLE #Rank_Table( id int identity(1,1) not null, hits bigint not null default 0, PERCENTILE smallint NULL, HITS_SQRT AS (CONVERT([int],sqrt(HITS*4),(0))) PERSISTED ) --Slant the distribution of the data INSERT INTO #Rank_Table (hits) select CASE when DATA > 9500 THEN DATA*30 WHEN data > 8000 THEN DATA*5 WHEN data < 7000 THEN DATA/3 +1 ELSE DATA END FROM (select top 10000 (ABS(CHECKSUM(NewId())) % 99 +1) * (ABS(CHECKSUM(NewId())) % 99 +1 ) DATA from master..spt_values t1 cross JOIN master..spt_values t2) exponential --Create temp work table and variables to calculate percentiles Declare @hitsPerGroup as int Declare @numGroups as int Declare @dataPoints as int set @numGroups=100 select @hitsPerGroup=SUM(HITS_SQRT)/(@numGroups -1) -@numGroups , @dataPoints=COUNT(*) FROM #Rank_Table --show the number of hits that each group should have select @hitsPerGroup HITS_PER_GROUP --Use temp table for the calculation CREATE TABLE #tbl ( row int, hits int, ID bigint, PERCENTILE smallint null ) --add index to row CREATE CLUSTERED INDEX idxRow ON #tbl(row) insert INTO #tbl select ROW_NUMBER() over (ORDER BY HITS), hits_SQRT, ID, null from #Rank_Table --Update each row with a running total. --lower the percentile by one when we cross a threshold for the maximum number of hits per group (@hitsPerGroup) DECLARE @row as int DEClare @runningTotal as int declare @percentile int set @row = 0 set @runningTotal = 0 set @percentile = @numGroups while @row <= @dataPoints BEGIN select @ runningTotal=@runningTotal + hits from #tbl where row=@row if @runningTotal >= @hitsPerGroup BEGIN update #tbl set PERCENTILE=@percentile WHERE PERCENTILE is null and row <@row set @percentile = @percentile - 1 set @runningTotal = 0 END --change rows set @row = @row + 1 END --get remaining update #tbl set PERCENTILE=@percentile WHERE PERCENTILE is null --update source data UPDATE m SET PERCENTILE = t.PERCENTILE FROM #tbl t inner join #Rank_Table m on t.ID=m.ID --Show the results SELECT PERCENTILE, COUNT(id) NUMBER_RECORDS, SUM(HITS) HITS_IN_PERCENTILE FROM #Rank_Table GROUP BY PERCENTILE ORDER BY PERCENTILE --cleanup DROP TABLE #Rank_Table DROP TABLE #tbl 

Performance is not stellar, but it achieves the goal of a smooth glide distribution.

0
source share

All Articles